[英]Python - How to compare two columns with mixed character strings but still represent the same value?
I have two dataframes like this:我有两个这样的数据框:
codename =
id code region
1 AAA Alpha
2 BBB Beta
3 CCC Gamma
4 DDD Delta
... ... ...
list =
id region code
1 BBB
2 DDD1
3 AAA
4 CCC10
5 AAA2
... ...
I want to fill the region column in the second dataframe by the code in the first dataframe.我想用第一个数据框中的代码填充第二个数据框中的区域列。 How do I compare these two code columns because in the second dataframe, the code have number but still represent the same region as the first three letter code.
我如何比较这两个代码列,因为在第二个数据框中,代码有数字但仍代表与前三个字母代码相同的区域。
Both of my datasets are quite big so is there any way to insert the value fastest.我的两个数据集都很大,所以有什么方法可以最快地插入值。 Thank you in advance!
先感谢您!
What you want to do is called a join - ie, fill in values in one table from another table according to agreement on a key column.您想要做的称为连接 - 即,根据键列上的协议从另一个表中填充一个表中的值。 pandas knows how to do this (doc)
熊猫知道如何做到这一点(文档)
First, you need to clean up the column you're joining on:首先,您需要清理您要加入的列:
# create a new column with the first 3 letters of values in the 'code' column
list['code_clean'] = list['code'].str.slice(0, 2) # keep first 3 letters
# drop the empty column from the list df so there's no overlap in the target column
list.drop('region', axis=1, inplace=True)
Now we can join on the key column (in your case it's the 'code' column).现在我们可以加入键列(在您的情况下它是“代码”列)。 pandas requires that the column be the index for the 'other' dataframe:
pandas 要求该列是“其他”数据框的索引:
list = list.join(codename.set_index('code'), on='code_clean')
list
out:出去:
id region code code_clean
1 Beta BBB BBB
2 Delta DDD1 DDD
3 Alpha AAA AAA
4 Gamma CCC10 CCC
5 Alpha AAA2 AAA
Also, never use a python built-in name for a variable name (the "list" dataframe).此外,永远不要使用 python 内置名称作为变量名(“列表”数据框)。 It can and will lead to unexpected behavior.
它可以而且将会导致意想不到的行为。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.