Cleaning country code cells with more than 2 characters in pandas

Question

I'm searching for some advice.

I have a data frame with 67,000 records. The problem I have is that my "Country" column was previously a free field (now it is a dropdown selection) so there various values for similar countries. For example there is DE, Germany, Alemania etc... which means that I cannot just take the first 2 values for a string because in the example above, German sales will be moved into Georgia.

I was wondering if anyone has had experience with this problem before and has a solution? I'm thinking I should change all the strings with >2 characters to "unlisted" and carry out a separate analysis there. I am not too sure how to go about to selection of the bad cells.

Would this be done with regex? or with a or a df.query?

Thanks in advance!

Answer 1

Use:

df = pd.DataFrame({'cc':['Germany', 'GE', 'IR']})
df[df['cc'].str.len()==2]

Result:

For the main problem I think you should have a list of country names and compare the values with length greater than two by that and select the most similar.

Cleaning country code cells with more than 2 characters in pandas

Question

1 answers

solution1
0 2022-02-23 08:28:49

Cleaning country code cells with more than 2 characters in pandas

Question

1 answers

solution1 0 2022-02-23 08:28:49

solution1
0 2022-02-23 08:28:49