I have thousands of CSV files that each contain hundreds of columns and hundreds of thousands of rows. For speeds I want to only import the data to PANDAS dataframes that I need. I can filter our the CSV files that I do not need using a separate metadata file, but I am having trouble figuring out how to drop the columns that I do not need (during the import -- I know how to filter columns of a dataframe after its been imported, but like I said, I am trying to avoid importing unnecessary data).
So let's say I have the following csv file:
Date/Time Apple Tart Cherry Pie Blueberry Pie Banana Pudding Tomato Soup
1:00 2 4 7 6 5
2:00 3 5 4 5 8
3:00 1 4 7 4 4
I want to import only columns that include the text "Pie", as well as the "Date/Time" column. Also note that the column names and number of columns are different for all of my csv files, so the "usecol" specification has not worked for me as-is since I do not know the specific column names to enter.
You can specify the column names when you use read_csv() as a list, for example:
df=pd.read_csv('fila.csv',names=['columnName#1','columnName3'])
Look that i did not use 'columnName2'.
The usecols parameter in pandas read_csv accepts a function to filter for the columns you are interested in:
import pandas as pd
from io import StringIO
data = """Date/Time Apple Tart Cherry Pie Blueberry Pie Banana Pudding Tomato Soup
1:00 2 4 7 6 5
2:00 3 5 4 5 8
3:00 1 4 7 4 4"""
df = pd.read_csv(StringIO(data),
sep='\s{2,}',
engine='python',
#this is the key part of the code for your usecase
#looks for columns that contain Pie or Date/Time
#and returns only those columns
#quite extensible as well, since it accepts a function
usecols = lambda x: ("Pie" in x) or ("Date/Time" in x) )
df
Date/Time Cherry Pie Blueberry Pie
0 1:00 4 7
1 2:00 5 4
2 3:00 4 7
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.