简体   繁体   中英

Cleaning country code cells with more than 2 characters in pandas

I'm searching for some advice.

I have a data frame with 67,000 records. The problem I have is that my "Country" column was previously a free field (now it is a dropdown selection) so there various values for similar countries. For example there is DE, Germany, Alemania etc... which means that I cannot just take the first 2 values for a string because in the example above, German sales will be moved into Georgia.

I was wondering if anyone has had experience with this problem before and has a solution? I'm thinking I should change all the strings with >2 characters to "unlisted" and carry out a separate analysis there. I am not too sure how to go about to selection of the bad cells.

Would this be done with regex? or with a or a df.query?

Thanks in advance!

Use:

df = pd.DataFrame({'cc':['Germany', 'GE', 'IR']})
df[df['cc'].str.len()==2]

Result:

在此处输入图像描述

For the main problem I think you should have a list of country names and compare the values with length greater than two by that and select the most similar.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM