When importing csv to PANDAS, how can you only import columns that contain a specified string within them?

Question

I have thousands of CSV files that each contain hundreds of columns and hundreds of thousands of rows. For speeds I want to only import the data to PANDAS dataframes that I need. I can filter our the CSV files that I do not need using a separate metadata file, but I am having trouble figuring out how to drop the columns that I do not need (during the import -- I know how to filter columns of a dataframe after its been imported, but like I said, I am trying to avoid importing unnecessary data).

So let's say I have the following csv file:

Date/Time  Apple Tart  Cherry Pie  Blueberry Pie  Banana Pudding  Tomato Soup
1:00       2           4           7              6               5
2:00       3           5           4              5               8
3:00       1           4           7              4               4

I want to import only columns that include the text "Pie", as well as the "Date/Time" column. Also note that the column names and number of columns are different for all of my csv files, so the "usecol" specification has not worked for me as-is since I do not know the specific column names to enter.

Answer 1

You can specify the column names when you use read_csv() as a list, for example:

df=pd.read_csv('fila.csv',names=['columnName#1','columnName3'])

Look that i did not use 'columnName2'.

Answer 2

The usecols parameter in pandas read_csv accepts a function to filter for the columns you are interested in:

import pandas as pd
from io import StringIO

data = """Date/Time  Apple Tart  Cherry Pie  Blueberry Pie  Banana Pudding  Tomato Soup
1:00       2           4           7              6               5
2:00       3           5           4              5               8
3:00       1           4           7              4               4"""


df = pd.read_csv(StringIO(data),
                 sep='\s{2,}',
                 engine='python',
                 #this is the key part of the code for your usecase
                 #looks for columns that contain Pie or Date/Time
                 #and returns only those columns
                 #quite extensible as well, since it accepts a function
                 usecols = lambda x: ("Pie" in x) or ("Date/Time" in x) )
df


Date/Time   Cherry Pie  Blueberry Pie
0   1:00    4   7
1   2:00    5   4
2   3:00    4   7

When importing csv to PANDAS, how can you only import columns that contain a specified string within them?

Question

2 answers

solution1
0 2020-06-09 22:55:26

solution2
0 ACCPTED 2020-06-09 23:18:26

When importing csv to PANDAS, how can you only import columns that contain a specified string within them?

Question

2 answers

solution1 0 2020-06-09 22:55:26

solution2 0 ACCPTED 2020-06-09 23:18:26

solution1
0 2020-06-09 22:55:26

solution2
0 ACCPTED 2020-06-09 23:18:26