I have a very large csv file that I cannot use pandas read_csv to load to my computer memory.
I look at dask.dataframe as dd
I need to use dask to read only certain rows of certain columns from that csv file and store it as a panda dataframe.
For example:
User ProductA ProductB
A 1 2
B 2 3
C 3 1
How can I only read the row for user C and column ProductA using dask?
Required output as data frame:
User ProductA
C 3
You can use the read_csv
function of dask.dataframe
, filter and then transform your df
to a pandas dataframe:
import dask.dataframe as dd
import pandas as pd
path2file = "yourpath.csv"
cols = ["User", "ProductA"]
# Be careful about the sep (check if it is ; or something else and add it to the
# function below as parameter if so
dataset = dd.read_csv(path2file, usecols=cols)
# Filter
dataset = dataset.loc[dataset["User"]=="C"]), :]
dataset = dataset.compute()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.