[英]Merge dataframe with dask and convert it to pandas
我有兩個數據框
dataframe1:
>df_case = dd.read_csv('s3://../.../df_case.csv')
>df_case.head(1)
sacc_id$ id$ creation_date
0 001A000000hwvV0IAI 5001200000ZnfUgAAJ 2016-06-07 14:38:02
dataframe2:
>df_limdata = dd.read_csv('s3://../.../df_limdata.csv')
>df_limdata.head(1)
sacc_id$ opp_line_id$ oppline_creation_date
0 001A000000hAUn8IAG a0W1200000G0i3UEAR 2015-06-10
首先,我合並了兩個數據框:
> case = dd.merge(df_limdata, df_case, left_on='sacc_id$',right_on='sacc_id$')
>case
Dask DataFrame Structure:
Unnamed: 0_x sacc_id$ opp_line_id$_x oppline_creation_date_x Unnamed: 0_y opp_line_id$_y oppline_creation_date_y
npartitions=5
int64 object object object int64 object object
... ... ... ... ... ... ...
... ... ... ... ... ... ... ...
... ... ... ... ... ... ...
... ... ... ... ... ... ...
Dask Name: hash-join, 78 tasks
然后我嘗試將這個普通案例數據框轉換為pandas數據框:
> # conversion to pandas
df = case.compute()
我收到此錯誤:
ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.
+------------+---------+----------+
| Column | Found | Expected |
+------------+---------+----------+
| Unnamed: 0 | float64 | int64 |
+------------+---------+----------+
Usually this is due to dask's dtype inference failing, and
*may* be fixed by specifying dtypes manually by adding:
dtype={'Unnamed: 0': 'float64'}
to the call to `read_csv`/`read_table`.
Alternatively, provide `assume_missing=True` to interpret
all unspecified integer columns as floats.
您能幫我解決這個問題嗎?
謝謝
在讀取文件時,dask假設列“ Unnamed:0”具有int64作為dtype,但后來在計算時發現它為float64。
因此,您在讀取文件時需要提及dtype:
df_case = dd.read_csv('s3://../.../df_case.csv',dtpye={'Unnamed: 0': 'float64'})
df_limdata = dd.read_csv('s3://../.../df_limdata.csv',dtpye={'Unnamed: 0': 'float64'})
case = dd.merge(df_limdata, df_case, left_on='sacc_id$',right_on='sacc_id$')
# conversion to pandas
df = case.compute()
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.