简体   繁体   English

在加载到pandas数据帧之前过滤掉CSV中的行

[英]Filter out rows from CSV before loading to pandas dataframe

I have a large csv file, that I cannot load into a DataFrame using read_csv() due to memory issues. 我有一个大的csv文件,由于内存问题我无法使用read_csv()加载到DataFrame中。 However in the first column of the csv there is a {0,1} flag, and I only need to load the rows with a '1', which will easily be small enough to fit in a DataFrame. 但是在csv的第一列中有一个{0,1}标志,我只需要加载一个'1'的行,它很容易小到足以放入DataFrame。 Is there any way to load the data with a condition, or to manipulate the csv prior to loading it (similar to grep)? 有没有办法用条件加载数据,或者在加载之前操纵csv(类似于grep)?

You can use pd.read_csv s the comment parameter and set it to '0' 您可以使用pd.read_csv comment参数并将其设置为'0'

import pandas as pd
from io import StringIO

txt = """col1,col2
1,a
0,b
1,c
0,d"""

pd.read_csv(StringIO(txt), comment='0')

   col1 col2
0     1    a
1     1    c

You can also use chunksize to turn pd.read_csv into an iterator and process it with query and pd.concat 您还可以使用chunksizepd.read_csv转换为迭代器并使用querypd.concat
NOTE: As the OP pointed out, chunk size of 1 isn't realistic. 注意:正如OP所指出的,块大小为1是不现实的。 I used it for demonstration purposes only. 我仅将它用于演示目的。 Please increase it to suit individual needs. 请增加它以满足个人需求。

pd.concat([df.query('col1 == 1') for df in pd.read_csv(StringIO(txt), chunksize=1)])
# Equivalent to and slower than... use the commented line for better performance
# pd.concat([df[df.col1 == 1] for df in pd.read_csv(StringIO(txt), chunksize=1)])

   col1 col2
0     1    a
2     1    c

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM