Merge dataframe with dask and convert it to pandas

Question

I have two dataframe

dataframe1:

    >df_case = dd.read_csv('s3://../.../df_case.csv')
    >df_case.head(1)
sacc_id$                   id$             creation_date
 0  001A000000hwvV0IAI  5001200000ZnfUgAAJ  2016-06-07 14:38:02

dataframe2:

>df_limdata = dd.read_csv('s3://../.../df_limdata.csv')
>df_limdata.head(1)
     sacc_id$            opp_line_id$           oppline_creation_date
0   001A000000hAUn8IAG  a0W1200000G0i3UEAR  2015-06-10

First, I did a merge of the 2 dataframes :

> case = dd.merge(df_limdata, df_case, left_on='sacc_id$',right_on='sacc_id$')

>case

Dask DataFrame Structure:
    Unnamed: 0_x    sacc_id$    opp_line_id$_x  oppline_creation_date_x     Unnamed: 0_y    opp_line_id$_y  oppline_creation_date_y
npartitions=5                           
    int64   object  object  object  int64   object  object
    ...     ...     ...     ...     ...     ...     ...
...     ...     ...     ...     ...     ...     ...     ...
    ...     ...     ...     ...     ...     ...     ...
    ...     ...     ...     ...     ...     ...     ...
Dask Name: hash-join, 78 tasks

Then I try to convert this dask case dataframe to pandas dataframe :

> # conversion to pandas
df = case.compute()

I get this error :

ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.

+------------+---------+----------+
| Column     | Found   | Expected |
+------------+---------+----------+
| Unnamed: 0 | float64 | int64    |
+------------+---------+----------+

Usually this is due to dask's dtype inference failing, and
*may* be fixed by specifying dtypes manually by adding:

dtype={'Unnamed: 0': 'float64'}

to the call to `read_csv`/`read_table`.

Alternatively, provide `assume_missing=True` to interpret
all unspecified integer columns as floats.

Can you help me to resolve this problem please?

Thank you

Answer 1

While reading the file dask assumed that column "Unnamed: 0" has int64 as dtype but later while computing it found it as float64.

Hence you need to mention the dtype while reading the file:

df_case = dd.read_csv('s3://../.../df_case.csv',dtpye={'Unnamed: 0': 'float64'})

df_limdata = dd.read_csv('s3://../.../df_limdata.csv',dtpye={'Unnamed: 0': 'float64'})


case = dd.merge(df_limdata, df_case, left_on='sacc_id$',right_on='sacc_id$')
# conversion to pandas
df = case.compute()

Merge dataframe with dask and convert it to pandas

Question

1 answers

solution1
1 ACCPTED 2019-01-31 18:59:11

Merge dataframe with dask and convert it to pandas

Question

1 answers

solution1 1 ACCPTED 2019-01-31 18:59:11

solution1
1 ACCPTED 2019-01-31 18:59:11