简体   繁体   English

Pandas read_csv 带过滤器 function 选择特殊行

[英]Pandas read_csv with a filter function to choose special row

As the world develops, data is getting bigger and bigger.随着世界的发展,数据变得越来越大。 Under normal circumstances, if we do not use clusters, our own MacBookPro cannot handle large amounts of data.一般情况下,如果我们不使用集群,我们自己的 MacBookPro 是无法处理大量数据的。

So does Pandas have a filter of read_csv function?那么 Pandas 是否有 read_csv function 的过滤器? In this way, instead of loading all the data into the memory at once, I can choose to load the data of the specified row at a time (similar to processing the grouped data after groupby).这样,我可以选择一次加载指定行的数据,而不是一次将所有数据加载到memory中(类似于groupby后处理分组数据)。 For example:例如:

Table:桌子:

A  B  C  D  E  F
a1 b1 0  1  1  0
a2 b1 1  0  0  1
...
an bm 0  1  1  0

Expected:预期的:

# exptec function like that
def my_filter(line):
  line_list = line.strip().split(Delimiter)
  if len(line_list) != 6:
     return None
  if line_list[0] != 'a1' and line_list[1] != 'b1':
     return None
  return line

# So we can read very little data at once,Even MacBookPro can handle.
# If read_csv function supports such a function
df = pd.read_csv(table_name, filter=my_filter)

I searched a lot of documents, but I didn't seem to find it.我搜索了很多文件,但似乎没有找到。 If it really does not exist, I hope Pandas can support it someday in the future如果真的不存在,希望Pandas在未来的某一天能支持

You can make use of the open builtin to filter lines you need and finally read the modified file using read_csv:您可以使用 open 内置来过滤您需要的行,最后使用 read_csv 读取修改后的文件:

from io import StringIO
out = StringIO() #you can also save the output into a different file
with open('file.csv') as f_in:
    for line in f_in:
        line_list = line.strip().split(",")
        if len(line_list) == 6 and (line_list[0] != 'a1' and line_list[1] != 'b1'):
            out.write(line)
out.seek(0)
print(pd.read_csv(out))

Or if you dont want to check by position but by membership within the line, you can use set :或者,如果您不想通过 position 而是通过行内的成员资格进行检查,则可以使用set

from io import StringIO
out = StringIO()
with open('file.csv') as f_in:
    for line in f_in:
        line_list = line.strip().split(",")
        if len(line_list) == 6 and (set(['a1','b1']).isdisjoint(line_list)):
            out.write(line)
out.seek(0)
print(pd.read_csv(out))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM