熊猫中csv的条件行读取

Question

I have large CSVs where I'm only interested in a subset of the rows.我有大型 CSV，我只对行的子集感兴趣。 In particular, I'd like to read in all the rows which occur before a particular condition is met.特别是，我想读入在满足特定条件之前发生的所有行。

For example, if read_csv would yield the dataframe:例如，如果read_csv会产生数据帧：

     A    B      C
1   34   3.20   'b'
2   24   9.21   'b'
3   34   3.32   'c'
4   24   24.3   'c'
5   35   1.12   'a'
... 
1e9 42   2.15   'd'

is there some way to read all the rows in the csv until col B exceeds 10. In the above example, I'd like to read in:有什么方法可以读取 csv 中的所有行，直到 col B 超过 10。在上面的示例中，我想读入：

     A    B      C
1   34   3.20   'b'
2   24   9.21   'b'
3   34   3.32   'c'
4   24   24.3   'c'

I know how to throw these rows out once I've read the dataframe in, but at this point I've already spent all that computation reading them in. I do not have access to the index of the final row before reading the csv (no skipfooter please)我知道如何在读入数据帧后抛出这些行，但此时我已经花费了所有计算来读取它们。在读取 csv 之前我无法访问最后一行的索引（请不要跳过页脚）

Answer 1

You could read the csv in chunks.您可以分块读取 csv。 Since pd.read_csv will return an iterator when the chunksize parameter is specified, you can use itertools.takewhile to read only as many chunks as you need, without reading the whole file.由于在指定chunksize参数时pd.read_csv将返回一个迭代器，因此您可以使用itertools.takewhile仅读取所需数量的块，而无需读取整个文件。

import itertools as IT
import pandas as pd

chunksize = 10 ** 5
chunks = pd.read_csv(filename, chunksize=chunksize, header=None)
chunks = IT.takewhile(lambda chunk: chunk['B'].iloc[-1] < 10, chunks)
df = pd.concat(chunks)
mask = df['B'] < 10
df = df.loc[mask]

Or, to avoid having to use df.loc[mask] to remove unwanted rows from the last chunk, perhaps a cleaner solution would be to define a custom generator:或者，为了避免使用df.loc[mask]从最后一个块中删除不需要的行，也许一个更df.loc[mask]解决方案是定义一个自定义生成器：

import itertools as IT
import pandas as pd

def valid(chunks):
    for chunk in chunks:
        mask = chunk['B'] < 10
        if mask.all():
            yield chunk
        else:
            yield chunk.loc[mask]
            break

chunksize = 10 ** 5
chunks = pd.read_csv(filename, chunksize=chunksize, header=None)
df = pd.concat(valid(chunks))

Answer 2

Building on @joanwa answer:以@joanwa 回答为基础：

df = (pd.read_csv("filename.csv")
      [lambda x: x['B'] > 10])

From Wes McKinney's "Python for Data Analysis" chapter on "Advanced pandas":来自 Wes McKinney 关于“Advanced pandas”的“Python for Data Analysis”一章：

We cannot refer to the result of load_data until it has been assigned to the temporary variable df.在将 load_data 分配给临时变量 df 之前，我们无法引用它的结果。 To help with this, assign and many other pandas functions accept function-like arguments, also known as callables.为了解决这个问题，assign 和许多其他 Pandas 函数接受类似函数的参数，也称为可调用参数。

To show callables in action, consider ...要在操作中显示可调用对象，请考虑...

df = load_data()
df2 = df[df['col2'] < 0]

Can be rewritten as:可以改写为：

df = (load_data()
      [lambda x: x['col2'] < 0])

Answer 3

You can use the built-in csv module to calculate the appropriate row number.您可以使用内置的csv模块来计算适当的行号。 Then use pd.read_csv with the nrows argument:然后使用pd.read_csv和nrows参数：

from io import StringIO
import pandas as pd
import csv, copy

mycsv = StringIO(""" A      B     C
34   3.20   'b'
24   9.21   'b'
34   3.32   'c'
24   24.3   'c'
35   1.12   'a'""")

mycsv2 = copy.copy(mycsv)  # copying StringIO object [for demonstration purposes]

with mycsv as fin:
    reader = csv.reader(fin, delimiter=' ', skipinitialspace=True)
    header = next(reader)
    counter = next(idx for idx, row in enumerate(reader) if float(row[1]) > 10)

df = pd.read_csv(mycsv2, delim_whitespace=True, nrows=counter+1)

print(df)

    A      B    C
0  34   3.20  'b'
1  24   9.21  'b'
2  34   3.32  'c'
3  24  24.30  'c'

Answer 4

I would go the easy route described here:我会走这里描述的简单路线：

http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing

df[df['B'] > 10]

熊猫中csv的条件行读取

问题描述

4 个解决方案

解决方案1
18 已采纳 2015-01-30 15:58:04

解决方案2
3 2020-05-30 11:59:51

解决方案3
0 2018-08-11 20:53:29

解决方案4
-1 2016-03-14 11:04:37

熊猫中csv的条件行读取

问题描述

4 个解决方案

解决方案1 18 已采纳 2015-01-30 15:58:04

解决方案2 3 2020-05-30 11:59:51

解决方案3 0 2018-08-11 20:53:29

解决方案4 -1 2016-03-14 11:04:37

解决方案1
18 已采纳 2015-01-30 15:58:04

解决方案2
3 2020-05-30 11:59:51

解决方案3
0 2018-08-11 20:53:29

解决方案4
-1 2016-03-14 11:04:37