I have folders in a google bucket that contain CSVs that I'm trying to read into dask.dataframe
s in order to do parallel normalization on the files. For example: Some of those dataframes may be missing a column that the rest of them have, so I want to insert the missing column into each dataframe that's missing it.
When using a globstring, such as ddfs = ddf.read_csv(f"gs://bucket/{folder}/*.csv")
I will expectedly receive pandas.errors.ParserErrors
because not only are some of the files' headers missing, some files' header row may not begin on the first row. I can loop through the directory and analyze each file prior to using the globstring with dask.dataframe
. Below is the logic I would use in that case:
import pandas as pd
file_analysis = dict()
for filepath in files:
skiprows = None
while True:
try:
df = pd.read_csv(filepath, nrows=nrows, skiprows=skiprows)
break
except pd.errors.ParserError as e:
try:
start_row_index = re.findall(r"Expected \d+ fields in line (\d+), saw \d+", str(e))[0]
skiprows = int(start_row_index) - 1
except IndexError:
print("Could not locate start_row_index in pandas ParserError message")
continue
headers = df.columns.values.tolist() # noqa
skiprows = skiprows + 1 if skiprows else 1
# store dictionary of pandas params that correspond to each file for `.read_csv()` calls
file_analysis[filepath] = dict(skiprows=skiprows, names=headers, dtype=dict.fromkeys(headers, str))
But, this would increase the execution time especially when some directories have thousands of files. And even then, I'm not sure how I would pass the dictionary values to dask
's dataframe.read_csv
Is there a way for me to pass a function to dask.dataframe.read_csv
that would allow for dynamic skiprows
and maybe also dynamic columns
for each CSV files in the google bucket folder of the globstring provided to the function?
AFAIK this is not possible to do it via dd.read_csv
, but you can construct a dask.dataframe
by using .from_delayed
, where each delayed is a wrapper around a function that normalizes the csv files and returns a pandas dataframe.
Note that from_delayed
will expect consistent column names and dtypes, so this is something that should be handled inside the function.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.