Python - 如何比较具有混合字符串的两列但仍表示相同的值？

Question

I have two dataframes like this:我有两个这样的数据框：

codename = 

id       code       region
1        AAA        Alpha
2        BBB        Beta
3        CCC        Gamma
4        DDD        Delta
...      ...        ...

list = 

id       region     code
1                   BBB
2                   DDD1
3                   AAA
4                   CCC10
5                   AAA2
...                 ...

I want to fill the region column in the second dataframe by the code in the first dataframe.我想用第一个数据框中的代码填充第二个数据框中的区域列。 How do I compare these two code columns because in the second dataframe, the code have number but still represent the same region as the first three letter code.我如何比较这两个代码列，因为在第二个数据框中，代码有数字但仍代表与前三个字母代码相同的区域。

Both of my datasets are quite big so is there any way to insert the value fastest.我的两个数据集都很大，所以有什么方法可以最快地插入值。 Thank you in advance!先感谢您！

Answer 1

What you want to do is called a join - ie, fill in values in one table from another table according to agreement on a key column.您想要做的称为连接 - 即，根据键列上的协议从另一个表中填充一个表中的值。 pandas knows how to do this (doc)熊猫知道如何做到这一点（文档）

First, you need to clean up the column you're joining on:首先，您需要清理您要加入的列：

# create a new column with the first 3 letters of values in the 'code' column
list['code_clean'] = list['code'].str.slice(0, 2)  # keep first 3 letters
# drop the empty column from the list df so there's no overlap in the target column
list.drop('region', axis=1, inplace=True)

Now we can join on the key column (in your case it's the 'code' column).现在我们可以加入键列（在您的情况下它是“代码”列）。 pandas requires that the column be the index for the 'other' dataframe: pandas 要求该列是“其他”数据框的索引：

list = list.join(codename.set_index('code'), on='code_clean')
list

out:出去：

id       region     code     code_clean
1        Beta       BBB      BBB
2        Delta      DDD1     DDD
3        Alpha      AAA      AAA
4        Gamma      CCC10    CCC
5        Alpha      AAA2     AAA

Also, never use a python built-in name for a variable name (the "list" dataframe).此外，永远不要使用 python 内置名称作为变量名（“列表”数据框）。 It can and will lead to unexpected behavior.它可以而且将会导致意想不到的行为。

Python - 如何比较具有混合字符串的两列但仍表示相同的值？

问题描述

1 个解决方案

解决方案1
2 已采纳 2021-11-08 14:38:12

Python - 如何比较具有混合字符串的两列但仍表示相同的值？

问题描述

1 个解决方案

解决方案1 2 已采纳 2021-11-08 14:38:12

解决方案1
2 已采纳 2021-11-08 14:38:12