[英]conditional row read of csv in pandas
I have large CSVs where I'm only interested in a subset of the rows.我有大型 CSV,我只对行的子集感兴趣。 In particular, I'd like to read in all the rows which occur before a particular condition is met.
特别是,我想读入在满足特定条件之前发生的所有行。
For example, if read_csv
would yield the dataframe:例如,如果
read_csv
会产生数据帧:
A B C
1 34 3.20 'b'
2 24 9.21 'b'
3 34 3.32 'c'
4 24 24.3 'c'
5 35 1.12 'a'
...
1e9 42 2.15 'd'
is there some way to read all the rows in the csv until col B exceeds 10. In the above example, I'd like to read in:有什么方法可以读取 csv 中的所有行,直到 col B 超过 10。在上面的示例中,我想读入:
A B C
1 34 3.20 'b'
2 24 9.21 'b'
3 34 3.32 'c'
4 24 24.3 'c'
I know how to throw these rows out once I've read the dataframe in, but at this point I've already spent all that computation reading them in. I do not have access to the index of the final row before reading the csv (no skipfooter please)我知道如何在读入数据帧后抛出这些行,但此时我已经花费了所有计算来读取它们。在读取 csv 之前我无法访问最后一行的索引(请不要跳过页脚)
You could read the csv in chunks.您可以分块读取 csv。 Since
pd.read_csv
will return an iterator when the chunksize
parameter is specified, you can use itertools.takewhile
to read only as many chunks as you need, without reading the whole file.由于在指定
chunksize
参数时pd.read_csv
将返回一个迭代器,因此您可以使用itertools.takewhile
仅读取所需数量的块,而无需读取整个文件。
import itertools as IT
import pandas as pd
chunksize = 10 ** 5
chunks = pd.read_csv(filename, chunksize=chunksize, header=None)
chunks = IT.takewhile(lambda chunk: chunk['B'].iloc[-1] < 10, chunks)
df = pd.concat(chunks)
mask = df['B'] < 10
df = df.loc[mask]
Or, to avoid having to use df.loc[mask]
to remove unwanted rows from the last chunk, perhaps a cleaner solution would be to define a custom generator:或者,为了避免使用
df.loc[mask]
从最后一个块中删除不需要的行,也许一个更df.loc[mask]
解决方案是定义一个自定义生成器:
import itertools as IT
import pandas as pd
def valid(chunks):
for chunk in chunks:
mask = chunk['B'] < 10
if mask.all():
yield chunk
else:
yield chunk.loc[mask]
break
chunksize = 10 ** 5
chunks = pd.read_csv(filename, chunksize=chunksize, header=None)
df = pd.concat(valid(chunks))
Building on @joanwa answer:以@joanwa 回答为基础:
df = (pd.read_csv("filename.csv")
[lambda x: x['B'] > 10])
From Wes McKinney's "Python for Data Analysis" chapter on "Advanced pandas":来自 Wes McKinney 关于“Advanced pandas”的“Python for Data Analysis”一章:
We cannot refer to the result of load_data until it has been assigned to the temporary variable df.在将 load_data 分配给临时变量 df 之前,我们无法引用它的结果。 To help with this, assign and many other pandas functions accept function-like arguments, also known as callables.
为了解决这个问题,assign 和许多其他 Pandas 函数接受类似函数的参数,也称为可调用参数。
To show callables in action, consider ...要在操作中显示可调用对象,请考虑...
df = load_data()
df2 = df[df['col2'] < 0]
Can be rewritten as:可以改写为:
df = (load_data()
[lambda x: x['col2'] < 0])
You can use the built-in csv
module to calculate the appropriate row number.您可以使用内置的
csv
模块来计算适当的行号。 Then use pd.read_csv
with the nrows
argument:然后使用
pd.read_csv
和nrows
参数:
from io import StringIO
import pandas as pd
import csv, copy
mycsv = StringIO(""" A B C
34 3.20 'b'
24 9.21 'b'
34 3.32 'c'
24 24.3 'c'
35 1.12 'a'""")
mycsv2 = copy.copy(mycsv) # copying StringIO object [for demonstration purposes]
with mycsv as fin:
reader = csv.reader(fin, delimiter=' ', skipinitialspace=True)
header = next(reader)
counter = next(idx for idx, row in enumerate(reader) if float(row[1]) > 10)
df = pd.read_csv(mycsv2, delim_whitespace=True, nrows=counter+1)
print(df)
A B C
0 34 3.20 'b'
1 24 9.21 'b'
2 34 3.32 'c'
3 24 24.30 'c'
I would go the easy route described here:我会走这里描述的简单路线:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
df[df['B'] > 10]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.