简体   繁体   中英

How can I utilize dask's dataframe.read_csv with a google storage globstring while using different skiprows values for each file?

I have folders in a google bucket that contain CSVs that I'm trying to read into dask.dataframe s in order to do parallel normalization on the files. For example: Some of those dataframes may be missing a column that the rest of them have, so I want to insert the missing column into each dataframe that's missing it.

My problem

When using a globstring, such as ddfs = ddf.read_csv(f"gs://bucket/{folder}/*.csv") I will expectedly receive pandas.errors.ParserErrors because not only are some of the files' headers missing, some files' header row may not begin on the first row. I can loop through the directory and analyze each file prior to using the globstring with dask.dataframe . Below is the logic I would use in that case:

import pandas as pd
file_analysis = dict()
for filepath in files:
    skiprows = None
    while True:
        try:
            df = pd.read_csv(filepath, nrows=nrows, skiprows=skiprows)
            break
        except pd.errors.ParserError as e:
            try:
                start_row_index = re.findall(r"Expected \d+ fields in line (\d+), saw \d+", str(e))[0]
                skiprows = int(start_row_index) - 1
            except IndexError:
                print("Could not locate start_row_index in pandas ParserError message")
                continue
    headers = df.columns.values.tolist()  # noqa
    skiprows = skiprows + 1 if skiprows else 1
    # store dictionary of pandas params that correspond to each file for `.read_csv()` calls
    file_analysis[filepath] = dict(skiprows=skiprows, names=headers, dtype=dict.fromkeys(headers, str))

But, this would increase the execution time especially when some directories have thousands of files. And even then, I'm not sure how I would pass the dictionary values to dask 's dataframe.read_csv

My question

Is there a way for me to pass a function to dask.dataframe.read_csv that would allow for dynamic skiprows and maybe also dynamic columns for each CSV files in the google bucket folder of the globstring provided to the function?

AFAIK this is not possible to do it via dd.read_csv , but you can construct a dask.dataframe by using .from_delayed , where each delayed is a wrapper around a function that normalizes the csv files and returns a pandas dataframe.

Note that from_delayed will expect consistent column names and dtypes, so this is something that should be handled inside the function.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM