[英]How to join two dataframes with different key column names in pydatatable?
I have a X dataframe as,我有一个 X 数据框,
DT_X = dt.Frame({
'date':['2020-09-01','2020-09-02','2020-09-03'],
'temp':[35.3,32.9,43.2]
})
Out[4]:
| date temp
-- + ---------- ----
0 | 2020-09-01 35.3
1 | 2020-09-02 32.9
2 | 2020-09-03 43.2
[3 rows x 2 columns]
Another dataframe Y as,另一个数据框 Y 为,
DT_Y = dt.Frame({
'stop_date' : ['2020-08-01','2020-09-01','2020-09-03','2020-09-07'],
'is_arrested':[True,False,False,True]
})
Out[6]:
| stop_date is_arrested
-- + ---------- -----------
0 | 2020-08-01 1
1 | 2020-09-01 0
2 | 2020-09-03 0
3 | 2020-09-07 1
[4 rows x 2 columns]
Now I would like to perform JOIN operation on X and Y, for that i'm supposed to assign a key on X dataframe as,现在我想对 X 和 Y 执行 JOIN 操作,为此我应该在 X 数据帧上分配一个键,
DT_X.key='date'
Out[8]:
date | temp
---------- + ----
2020-09-01 | 35.3
2020-09-02 | 32.9
2020-09-03 | 43.2
[3 rows x 2 columns]
Next I'm joining X and Y as ,接下来我将加入 X 和 Y 作为 ,
DT_Y[:,:,join(DT_X)]
Here it is throwing out an error as ,在这里它抛出一个错误,
In [9]: DT_Y[:,:,join(DT_X)]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-9-a3bc1690fb98> in <module>
----> 1 DT_Y[:,:,join(DT_X)]
ValueError: Key column `date` does not exist in the left Frame
Of course date is not existed in DT_Y, it has a column name as stop_date .当然日期在 DT_Y 中不存在,它的列名为stop_date 。
How to perform join operations in this scenario ??在这种情况下如何执行连接操作?? ie No match of column names.
即没有匹配的列名。
Note :注意:
An work around for this is to change the column name of DT_Y as解决此问题的方法是将 DT_Y 的列名更改为
DT_Y.names = {'stop_date':'date'}
DT_Y[:,:,join(DT_X)]
The joined DT can be viewed as,加入的 DT 可以被视为,
Out[11]:
| date is_arrested temp
-- + ---------- ----------- ----
0 | 2020-08-01 1 NA
1 | 2020-09-01 0 35.3
2 | 2020-09-03 0 43.2
3 | 2020-09-07 1 NA
[4 rows x 3 columns]
Here is the expected output:这是预期的输出:
Out[13]:
| stop_date is_arrested temp
-- + ---------- ----------- ----
0 | 2020-08-01 1 NA
1 | 2020-09-01 0 35.3
2 | 2020-09-03 0 43.2
3 | 2020-09-07 1 NA
[4 rows x 3 columns]
Right now, join()
only supports the same column names in the both frames, please refer to documentation for more details.目前,
join()
仅支持两个框架中的相同列名,请参阅文档了解更多详细信息。 However, there is an open issue to improve the join functionality/API.但是,有一个未解决的问题需要改进连接功能/API。
Meanwhile, if you prefer not to rename the columns you can do the following同时,如果您不想重命名列,您可以执行以下操作
DT_Y_date = DT_Y[:, {"date":f[0], "is_arrested":f[1]}]
DT_YX_joined = DT_Y_date[:, :, join(DT_X)]
Then, DT_YX_joined
will have the data you are looking for然后,
DT_YX_joined
就会有你要找的数据
| date is_arrested temp
-- + ---------- ----------- ----
0 | 2020-08-01 1 NA
1 | 2020-09-01 0 35.3
2 | 2020-09-03 0 43.2
3 | 2020-09-07 1 NA
You can even do a one-liner like你甚至可以做一个像
DT_YX_joined = DT_Y[:, {"date":f[0], "is_arrested":f[1]}][:, :, join(DT_X)]
but it may not be readable enough.但它可能不够可读。 Also note, that no data copies are created here, it is only the column name that changes.
还要注意,这里没有创建数据副本,只是更改了列名。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.