简体   繁体   中英

Very Big CSV File - How to Read Only Certain Rows into Data Frame

I have a very large csv file that I cannot use pandas read_csv to load to my computer memory.

I look at dask.dataframe as dd

I need to use dask to read only certain rows of certain columns from that csv file and store it as a panda dataframe.

For example:

User  ProductA  ProductB
A     1         2
B     2         3
C     3         1

How can I only read the row for user C and column ProductA using dask?

Required output as data frame:

User  ProductA
C     3

You can use the read_csv function of dask.dataframe , filter and then transform your df to a pandas dataframe:

import dask.dataframe as dd
import pandas as pd

path2file = "yourpath.csv"
cols = ["User", "ProductA"]
# Be careful about the sep (check if it is ; or something else and add it to the
# function below as parameter if so
dataset = dd.read_csv(path2file, usecols=cols)
# Filter 
dataset = dataset.loc[dataset["User"]=="C"]), :]
dataset = dataset.compute()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM