简体   繁体   中英

Pandas read_csv with a filter function to choose special row

As the world develops, data is getting bigger and bigger. Under normal circumstances, if we do not use clusters, our own MacBookPro cannot handle large amounts of data.

So does Pandas have a filter of read_csv function? In this way, instead of loading all the data into the memory at once, I can choose to load the data of the specified row at a time (similar to processing the grouped data after groupby). For example:

Table:

A  B  C  D  E  F
a1 b1 0  1  1  0
a2 b1 1  0  0  1
...
an bm 0  1  1  0

Expected:

# exptec function like that
def my_filter(line):
  line_list = line.strip().split(Delimiter)
  if len(line_list) != 6:
     return None
  if line_list[0] != 'a1' and line_list[1] != 'b1':
     return None
  return line

# So we can read very little data at once,Even MacBookPro can handle.
# If read_csv function supports such a function
df = pd.read_csv(table_name, filter=my_filter)

I searched a lot of documents, but I didn't seem to find it. If it really does not exist, I hope Pandas can support it someday in the future

You can make use of the open builtin to filter lines you need and finally read the modified file using read_csv:

from io import StringIO
out = StringIO() #you can also save the output into a different file
with open('file.csv') as f_in:
    for line in f_in:
        line_list = line.strip().split(",")
        if len(line_list) == 6 and (line_list[0] != 'a1' and line_list[1] != 'b1'):
            out.write(line)
out.seek(0)
print(pd.read_csv(out))

Or if you dont want to check by position but by membership within the line, you can use set :

from io import StringIO
out = StringIO()
with open('file.csv') as f_in:
    for line in f_in:
        line_list = line.strip().split(",")
        if len(line_list) == 6 and (set(['a1','b1']).isdisjoint(line_list)):
            out.write(line)
out.seek(0)
print(pd.read_csv(out))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM