简体   繁体   中英

Get subset of rows where any column contains a particular value

I have a very large data file (foo.sas7bdat) that I would like to filter rows from without loading the whole data file into memory. For example, I can print the first 20 rows of the dataset without loading the entire file into memory by doing the following:

import pandas
import itertools

with pandas.read_sas('foo.sas7bdat') as f:
    for row in itertools.islice(f,20):
        print(row)

However, I am unclear on how to only print (or preferably place in a new file) only rows that have any column that contain the number 123.1. How can I do this?

Pandas has the ability to pull dataframes one chunk at a time. Following the trail of read_sas() documentation to "chunksize" I came across this:

http://pandas.pydata.org/pandas-docs/stable/io.html#iterating-through-files-chunk-by-chunk

for chunk in pd.read_sas('foo.sas7bdat', interator=True, chunksize=100000):
    print(chunk)

This would get chunks of 100,000 lines. As for the other problem you would need a query. However I don't know the constraints of the problem. If you make a Dataframe with all the columns then you still might overflow your memory space so an efficient way would be to collect the indexes and put those in a set, then sort those and use .iloc to get those entries if you wanted to put those into a Dataframe.

You may need to use tools that take this into account. Dask is a good alternative for use on clusters.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM