简体   繁体   中英

Convert dask to pandas dataframe

I have a quite similiar question to this one: Dask read_csv-- Mismatched dtypes found in `pd.read_csv`/`pd.read_table`

I am running the following script:

import pandas as pd
import dask.dataframe as dd
df2 = dd.read_csv("Path/*.csv", sep='\t', encoding='unicode_escape', sample=2500000)
df2 = df2.loc[~df2['Type'].isin(['STVKT','STKKT', 'STVK', 'STKK', 'STKET', 'STVET', 'STK', 'STKVT', 'STVVT', 'STV', 'STVZT', 'STVV', 'STKV', 'STVAT', 'STKAT', 'STKZT', 'STKAO', 'STKZE', 'STVAO', 'STVZE', 'STVT', 'STVNT'])]
df2 = df.compute()

And i get the following errror: ValueError: Mismatched dtypes found in pd.read_csv / pd.read_table .

How can I avoid that? I have over 32 columns, so i can't setup the dtypes upfront. As a hint it is also written: Specify dtype option on import or set low_memory=False

When Dask loads your CSV, it tries to derive the dtypes from the header of the file, and then assumes that the rest of the parts of the files have the same dtypes for each column. Sine pandas types from csv depend on the set of values seen, this is where the error comes from.

To fix, you either have to explicitly tell dask what types to expect, or increase the size of the portion dask tries to guess types from ( sample= ).

The error message should have told you which columns were not matching and the types found, so you only need to specify those to get things working.

Maybe try this:

df = pd.DataFrame()
df = df2.compute()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM