Read Csv File as Dataframe in Python
CSV (comma-separated value) files are a common file format for transferring and storing data. The ability to read, manipulate, and write data to and from CSV files using Python is a fundamental skill to principal for any data scientist or business assay. In this post, we'll go over what CSV files are, how to read CSV files into Pandas DataFrames, and how to write DataFrames dorsum to CSV files post analysis.
Pandas is the nearly popular data manipulation package in Python, and DataFrames are the Pandas information blazon for storing tabular 2D data.
- Load CSV files to Python Pandas
- 1. File Extensions and File Types
- 2. Data Representation in CSV files
- Other Delimiters / Separators – TSV files
- Delimiters in Text Fields – Quotechar
- three. Python – Paths, Folders, Files
- Finding your Python Path
- File Loading: Absolute and Relative Paths
- iv. Pandas CSV File Loading Errors
- Avant-garde Read CSV Files
- Specifying Data Types
- Skipping and Picking Rows and Columns From File
- Custom Missing Value Symbols
- CSV Format Advantages and Disadvantages
- Additional Reading
Load CSV files to Python Pandas
The basic process of loading information from a CSV file into a Pandas DataFrame (with all going well) is achieved using the "read_csv" function in Pandas:
# Load the Pandas libraries with alias 'pd' import pandas as pd # Read data from file 'filename.csv' # (in the same directory that your python process is based) # Control delimiters, rows, cavalcade names with read_csv (see later) information = pd.read_csv("filename.csv") # Preview the get-go 5 lines of the loaded information information.caput()
While this code seems elementary, an understanding of iii fundamental concepts is required to fully grasp and debug the performance of the data loading process if you lot encounter issues:
- Understanding file extensions and file types – what do the letters CSV really mean? What'southward the divergence betwixt a .csv file and a .txt file?
- Understanding how data is represented inside CSV files – if you open a CSV file, what does the data actually look like?
- Agreement the Python path and how to reference a file – what is the accented and relative path to the file you are loading? What directory are y'all working in?
- CSV data formats and errors – mutual errors with the part.
Each of these topics is discussed beneath, and we finish this tutorial by looking at some more than advanced CSV loading mechanisms and giving some broad advantages and disadvantages of the CSV format.
i. File Extensions and File Types
The first footstep to working with comma-separated-value (CSV) files is understanding the concept of file types and file extensions.
- Data is stored on your calculator in individual "files", or containers, each with a different name.
- Each file contains data of dissimilar types – the internals of a Word document is quite dissimilar from the internals of an image.
- Computers decide how to read files using the "file extension", that is the code that follows the dot (".") in the filename.
- And then, a filename is typically in the form "<random name>.<file extension>". Examples:
- project1.DOCX – a Microsoft Discussion file called Project1.
- shanes_file.TXT – a elementary text file called shanes_file
- IMG_5673.JPG – An image file chosen IMG_5673.
- Other well known file types and extensions include: XLSX: Excel, PDF: Portable Document Format, PNG – images, ZIP – compressed file format, GIF – animation, MPEG – video, MP3 – music etc. See a complete listing of extensions here.
- A CSV file is a file with a ".csv" file extension, eastward.chiliad. "data.csv", "super_information.csv". The "CSV" in this case lets the computer know that the data contained in the file is in "comma separated value" format, which nosotros'll hash out below.
File extensions are hidden past default on a lot of operating systems. The first step that any cocky-respecting engineer, software engineer, or data scientist volition do on a new computer is to ensure that file extensions are shown in their Explorer (Windows) or Finder (Mac) windows.

To bank check if file extensions are showing in your organisation, create a new text document with Notepad (Windows) or TextEdit (Mac) and salvage it to a folder of your choice. If y'all can't see the ".txt" extension in your folder when you view it, you will have to alter your settings.
- In Microsoft Windows: Open up Control Console > Appearance and Personalization. Now, click on Folder Options or File Explorer Choice, equally it is now called > View tab. In this tab, under Accelerate Settings, you lot volition see the option Hide extensions for known file types. Uncheck this choice and click on Apply and OK.
- In Mac OS: Open Finder > In menu, click Finder > Preferences, Click Advanced, Select the checkbox for "Show all filename extensions".
2. Data Representation in CSV files
A "CSV" file, that is, a file with a "csv" filetype, is a basic text file. Whatever text editor such every bit NotePad on windows or TextEdit on Mac, tin can open a CSV file and show the contents. Sublime Text is a wonderful and multi-functional text editor option for any platform.
CSV is a standard for storing tabular data in text format, where commas are used to split up the dissimilar columns, and newlines (carriage return / printing enter) used to separate rows. Typically, the showtime row in a CSV file contains the names of the columns for the information.
And instance table information set up and the corresponding CSV-format data is shown in the diagram below.

Note that almost whatever tabular data can exist stored in CSV format – the format is popular considering of its simplicity and flexibility. Y'all can create a text file in a text editor, save it with a .csv extension, and open that file in Excel or Google Sheets to see the table form.
Other Delimiters / Separators – TSV files
The comma separation scheme is by far the most popular method of storing tabular data in text files.
However, the selection of the ',' comma character to delimiters columns, however, is arbitrary, and can be substituted where needed. Popular alternatives include tab ("\t") and semi-colon (";"). Tab-split up files are known equally TSV (Tab-Separated Value) files.
When loading data with Pandas, the read_csv function is used for reading any delimited text file, and by changing the delimiter using the sep
parameter.
Delimiters in Text Fields – Quotechar
1 complication in creating CSV files is if y'all have commas, semicolons, or tabs really in one of the text fields that you want to store. In this example, it'south important to use a "quote graphic symbol" in the CSV file to create these fields.
The quote character tin be specified in Pandas.read_csv using the quotechar
argument. By default (as with many systems), it's set as the standard quotation marks ("). Any commas (or other delimiters as demonstrated below) that occur between two quote characters volition exist ignored as column separators.
In the example shown, a semicolon-delimited file, with quotation marks equally a quotechar is loaded into Pandas, and shown in Excel. The use of the quotechar allows the "NickName" cavalcade to contain semicolons without beingness split into more columns.
three. Python – Paths, Folders, Files
When you lot specify a filename to Pandas.read_csv, Python will look in your "current working directory". Your working directory is typically the directory that y'all started your Python procedure or Jupyter notebook from.

Finding your Python Path
Your Python path can be displayed using the congenital-in os
module. The Os module is for operating organisation dependent functionality into Python programs and scripts.
To discover your current working directory, the function required is os.getcwd()
. Theos.listdir()
function tin can exist used to display all files in a directory, which is a good cheque to run into if the CSV file yous are loading is in the directory as expected.
# Discover out your current working directory import os print(os.getcwd()) # Out: /Users/shane/Documents/blog # Display all of the files plant in your current working directory print(bone.listdir(os.getcwd()) # Out: ['test_delimted.ssv', 'CSV Web log.ipynb', 'test_data.csv']
In the example above, my current working directory is in the '/Users/Shane/Certificate/blog' directory. Whatever files that are places in this directory volition be immediately available to the Python file open() office or the Pandas read csv part.
Instead of moving the required information files to your working directory, you tin can also modify your current working directory to the directory where the files reside usingos.chdir()
.
File Loading: Accented and Relative Paths
When specifying file names to the read_csv function, you can supply both absolute or relative file paths.
- A relative pathis the path to the file if you get-go from your current working directory. In relative paths, typically the file will exist in a subdirectory of the working directory and the path will not first with a bulldoze specifier, e.g. (data/test_file.csv). The characters '..' are used to movement to a parent directory in a relative path.
- An absolute pathis the consummate path from the base of your file system to the file that you want to load, e.g. c:/Documents/Shane/data/test_file.csv. Absolute paths will beginning with a drive specifier (c:/ or d:/ in Windows, or '/' in Mac or Linux)
It's recommended and preferred to use relative paths where possible in applications, considering absolute paths are unlikely to work on different computers due to different directory structures.

4. Pandas CSV File Loading Errors
The almost common mistake's you'll get while loading data from CSV files into Pandas will be:
-
FileNotFoundError: File b'filename.csv' does non be
A File Not Found error is typically an result with path setup, electric current directory, or file name confusion (file extension can play a part here!) -
UnicodeDecodeError: 'utf-8' codec can't decode byte in position : invalid continuation byte
A Unicode Decode Error is typically caused by not specifying the encoding of the file, and happens when you accept a file with non-standard characters. For a quick prepare, try opening the file in Sublime Text, and re-saving with encoding 'UTF-8'. -
pandas.parser.CParserError: Error tokenizing data.
Parse Errors can exist acquired in unusual circumstances to exercise with your data format – endeavour to add the parameter "engine='python'" to the read_csv part telephone call; this changes the data reading function internally to a slower just more than stable method.
Advanced Read CSV Files
There are some boosted flexible parameters in the Pandas read_csv() function that are useful to accept in your arsenal of data science techniques:
Specifying Data Types
Equally mentioned before, CSV files do not contain any blazon data for information. Data types are inferred through examination of the top rows of the file, which can lead to errors. To manually specify the data types for different columns, thedtype parameter tin can be used with a lexicon of column names and information types to exist practical, for example:dtype={"name": str, "age": np.int32}
.
Note that for dates and date times, the format, columns, and other behaviour can exist adjusted using parse_dates, date_parser, dayfirst, keep_dateparameters.
Skipping and Picking Rows and Columns From File
Thenrows parameter specifies how many rows from the top of CSV file to read, which is useful to take a sample of a big file without loading completely. Similarly theskiprowsparameter allows you to specify rows to get out out, either at the first of the file (provide an int), or throughout the file (provide a list of row indices). Similarly, theusecolsparameter can be used to specify which columns in the data to load.
Custom Missing Value Symbols
When data is exported to CSV from different systems, missing values can be specified with dissimilar tokens. Thena_values parameter allows you to customise the characters that are recognised every bit missing values. The default values interpreted every bit NA/NaN are: '', '#N/A', '#N/A N/A', '#NA', '-ane.#IND', '-ane.#QNAN', '-NaN', '-nan', '1.#IND', '1.#QNAN', 'N/A', 'NA', 'NULL', 'NaN', 'northward/a', 'nan', 'naught'.
# Advanced CSV loading example data = pd.read_csv( "data/files/complex_data_example.tsv", # relative python path to subdirectory sep='\t' # Tab-separated value file. quotechar="'", # unmarried quote allowed as quote character dtype={"salary": int}, # Parse the salary column equally an integer usecols=['name', 'birth_date', 'salary']. # Simply load the three columns specified. parse_dates=['birth_date'], # Intepret the birth_date column as a date skiprows=ten, # Skip the kickoff x rows of the file na_values=['.', '??'] # Take any '.' or '??' values as NA )
CSV Format Advantages and Disadvantages
As with all technical decisions, storing your information in CSV format has both advantages and disadvantages. Exist enlightened of the potential pitfalls and issues that you will see as you load, shop, and exchange data in CSV format:
On the plus side:
- CSV format is universal and the data can be loaded past almost any software.
- CSV files are unproblematic to understand and debug with a basic text editor
- CSV files are quick to create and load into memory before assay.
However, the CSV format has some negative sides:
- There is no data type data stored in the text file, all typing (dates, int vs float, strings) are inferred from the data only.
- There's no formatting or layout data storable – things like fonts, borders, column width settings from Microsoft Excel will exist lost.
- File encodings tin get a problem if there are non-ASCII uniform characters in text fields.
- CSV format is inefficient; numbers are stored as characters rather than binary values, which is wasteful. You will find notwithstanding that your CSV data compresses well using zip compression.
As and aside, in an attempt to counter some of these disadvantages, two prominent data science developers in both the R and Python ecosystems, Wes McKinney and Hadley Wickham, recently introduced the Plume Format, which aims to be a fast, simple, open up, flexible and multi-platform data format that supports multiple data types natively.
Additional Reading
- Official Pandas documentation for the read_csv function.
- Python 3 Notes on file paths, working directories, and using the Os module.
- Datacamp Tutorial on loading CSV files, including some boosted OS commands.
- PythonHow Loading CSV tutorial.
Source: https://www.shanelynn.ie/python-pandas-read-csv-load-data-from-csv-files/
0 Response to "Read Csv File as Dataframe in Python"
Postar um comentário