简体   繁体   English

Python - 如何比较具有混合字符串的两列但仍表示相同的值?

[英]Python - How to compare two columns with mixed character strings but still represent the same value?

I have two dataframes like this:我有两个这样的数据框:

codename = 

id       code       region
1        AAA        Alpha
2        BBB        Beta
3        CCC        Gamma
4        DDD        Delta
...      ...        ...   
list = 

id       region     code
1                   BBB
2                   DDD1
3                   AAA
4                   CCC10
5                   AAA2
...                 ...

I want to fill the region column in the second dataframe by the code in the first dataframe.我想用第一个数据框中的代码填充第二个数据框中的区域列。 How do I compare these two code columns because in the second dataframe, the code have number but still represent the same region as the first three letter code.我如何比较这两个代码列,因为在第二个数据框中,代码有数字但仍代表与前三个字母代码相同的区域。

Both of my datasets are quite big so is there any way to insert the value fastest.我的两个数据集都很大,所以有什么方法可以最快地插入值。 Thank you in advance!先感谢您!

What you want to do is called a join - ie, fill in values in one table from another table according to agreement on a key column.您想要做的称为连接 - 即,根据键列上的协议从另一个表中填充一个表中的值。 pandas knows how to do this (doc)熊猫知道如何做到这一点(文档)

First, you need to clean up the column you're joining on:首先,您需要清理您要加入的列:

# create a new column with the first 3 letters of values in the 'code' column
list['code_clean'] = list['code'].str.slice(0, 2)  # keep first 3 letters
# drop the empty column from the list df so there's no overlap in the target column
list.drop('region', axis=1, inplace=True)

Now we can join on the key column (in your case it's the 'code' column).现在我们可以加入键列(在您的情况下它是“代码”列)。 pandas requires that the column be the index for the 'other' dataframe: pandas 要求该列是“其他”数据框的索引:

list = list.join(codename.set_index('code'), on='code_clean')
list

out:出去:

id       region     code     code_clean
1        Beta       BBB      BBB
2        Delta      DDD1     DDD
3        Alpha      AAA      AAA
4        Gamma      CCC10    CCC
5        Alpha      AAA2     AAA

Also, never use a python built-in name for a variable name (the "list" dataframe).此外,永远不要使用 python 内置名称作为变量名(“列表”数据框)。 It can and will lead to unexpected behavior.它可以而且将会导致意想不到的行为。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM