Very Big CSV File - How to Read Only Certain Rows into Data Frame

Question

I have a very large csv file that I cannot use pandas read_csv to load to my computer memory.

I look at dask.dataframe as dd

I need to use dask to read only certain rows of certain columns from that csv file and store it as a panda dataframe.

For example:

User  ProductA  ProductB
A     1         2
B     2         3
C     3         1

How can I only read the row for user C and column ProductA using dask?

Required output as data frame:

User  ProductA
C     3

Answer 1

You can use the read_csv function of dask.dataframe , filter and then transform your df to a pandas dataframe:

import dask.dataframe as dd
import pandas as pd

path2file = "yourpath.csv"
cols = ["User", "ProductA"]
# Be careful about the sep (check if it is ; or something else and add it to the
# function below as parameter if so
dataset = dd.read_csv(path2file, usecols=cols)
# Filter 
dataset = dataset.loc[dataset["User"]=="C"]), :]
dataset = dataset.compute()

Very Big CSV File - How to Read Only Certain Rows into Data Frame

Question

1 answers

solution1
2 2020-04-04 06:32:03

Very Big CSV File - How to Read Only Certain Rows into Data Frame

Question

1 answers

solution1 2 2020-04-04 06:32:03

solution1
2 2020-04-04 06:32:03