How can I utilize dask's dataframe.read_csv with a google storage globstring while using different skiprows values for each file?

Question

I have folders in a google bucket that contain CSVs that I'm trying to read into dask.dataframe s in order to do parallel normalization on the files. For example: Some of those dataframes may be missing a column that the rest of them have, so I want to insert the missing column into each dataframe that's missing it.

My problem

When using a globstring, such as ddfs = ddf.read_csv(f"gs://bucket/{folder}/*.csv") I will expectedly receive pandas.errors.ParserErrors because not only are some of the files' headers missing, some files' header row may not begin on the first row. I can loop through the directory and analyze each file prior to using the globstring with dask.dataframe . Below is the logic I would use in that case:

import pandas as pd
file_analysis = dict()
for filepath in files:
    skiprows = None
    while True:
        try:
            df = pd.read_csv(filepath, nrows=nrows, skiprows=skiprows)
            break
        except pd.errors.ParserError as e:
            try:
                start_row_index = re.findall(r"Expected \d+ fields in line (\d+), saw \d+", str(e))[0]
                skiprows = int(start_row_index) - 1
            except IndexError:
                print("Could not locate start_row_index in pandas ParserError message")
                continue
    headers = df.columns.values.tolist()  # noqa
    skiprows = skiprows + 1 if skiprows else 1
    # store dictionary of pandas params that correspond to each file for `.read_csv()` calls
    file_analysis[filepath] = dict(skiprows=skiprows, names=headers, dtype=dict.fromkeys(headers, str))

But, this would increase the execution time especially when some directories have thousands of files. And even then, I'm not sure how I would pass the dictionary values to dask 's dataframe.read_csv

My question

Is there a way for me to pass a function to dask.dataframe.read_csv that would allow for dynamic skiprows and maybe also dynamic columns for each CSV files in the google bucket folder of the globstring provided to the function?

Answer 1

AFAIK this is not possible to do it via dd.read_csv , but you can construct a dask.dataframe by using .from_delayed , where each delayed is a wrapper around a function that normalizes the csv files and returns a pandas dataframe.

Note that from_delayed will expect consistent column names and dtypes, so this is something that should be handled inside the function.

How can I utilize dask's dataframe.read_csv with a google storage globstring while using different skiprows values for each file?

Question

My problem

My question

1 answers

solution1
0 2021-06-08 16:24:19

How can I utilize dask's dataframe.read_csv with a google storage globstring while using different skiprows values for each file?

Question

My problem

My question

1 answers

solution1 0 2021-06-08 16:24:19

solution1
0 2021-06-08 16:24:19