简体   繁体   中英

Get country from data in column in dataframe python

I have an excel.csv file that contains a column called Made From that identifies the country/region eg Made in Australia or Made in HK or even with region Made in APAC

So my goal is to identify no matter which ISO country code or region name it is using. I want to put it into a new column called Country and another column called Region respectively.

Currently, I am trying this code to no avail. df[Country] = df["Made From"].apply(lamda x:x if x in countries else "Global") . The countries is an array of countries = ["Australia", "Mexico"...] that I wrote a bit of but no sure is there is a better way or solution out there that has a full list of all ISO codes and region names. If there are no standard region name, I could always do up a list as it is only naming a few regions.

Please help me on this as I am stuck here. Please let me know if there is any more clarification needed. Thank you.

I am coding in Python

UPDATE: As requested, input data and expected output.

input

|Made From        |
-------------------
|Made in Australia|
|Made in HK       |
|Made in APAC     |
|UK Made          |

Expected Output

|Made From        |Country       |Region|
------------------------------------------
|Made in Australia|Australia     |APAC  |
|Made in HK       |Hong Kong     |APAC  |
|Made in APAC     |              |APAC  |
|UK Made          |United Kingdom|Europe|

I think your approach is ok if you can manage to create a complete countries list. Here is a small example:

import pandas as pd
import pycountry_convert as pc
import pycountry
from itertools import compress

continents = {
    'NA': 'North America',
    'SA': 'South America', 
    'AS': 'APAC',
    'OC': 'APAC',
    'AF': 'Africa',
    'EU': 'Europe'
}

regions = ["APAC", "EMEA", "NA", "SA"]

lc = list(pycountry.countries)

#It seems UK is not part of the ISO
exceptions = {"UK": "EMEA"}
countries = [lc[i].name for i in range(len(lc))] + \
            [lc[i].alpha_2 for i in range(len(lc))] + \
            [lc[i].alpha_3 for i in range(len(lc))] + ["UK"]
        
dataset = pd.DataFrame([[1,"Made in Australia"],[1,"Made in HK"],[1,"Made in APAC"], [1,"UK Made"]], 
                       columns = ["Other Columns", "Made From"])

# Assuming proper division of words
dataset["Countries"] = dataset["Made From"].apply(lambda x: [list(compress(x.split(" "), [x.split(" ")[i] in countries for i in range(len(x.split(" ")))])) + ["Global"]][0][0])
dataset["Regions"] = dataset["Made From"].apply(lambda x: [list(compress(x.split(" "), [x.split(" ")[i] in regions for i in range(len(x.split(" ")))])) + ["Local"]][0][0])

for i in range(len(dataset)):
    if dataset.iloc[i,3] == "Local":
        try:
            c = pc.country_name_to_country_alpha2(dataset.iloc[i,2], cn_name_format="default")
            c = pc.country_alpha2_to_continent_code(c)
            dataset.iat[i,3] = c
        except:
            try:
                c = pc.country_alpha2_to_continent_code(dataset.iloc[i,2])
                dataset.iat[i,3] = c
            except:
                try:
                    c = pc.country_alpha3_to_continent_code(dataset.iloc[i,2])
                    dataset.iat[i,3] = c                    
                except:
                    if dataset.iloc[i,2] in list(exceptions.keys()):
                        dataset.iat[i,3] = exceptions[dataset.iloc[i,2]]
        try:
            dataset.iat[i,3] = continents[dataset.iloc[i,3]]
        except:
            pass
dataset

    Other Columns   Made From           Countries   Regions
0   1               Made in Australia   Australia   APAC
1   1               Made in HK          HK          APAC
2   1               Made in APAC        Global      APAC
3   1               UK Made             UK          EMEA

Alternatively, if you see that there are small number of non-country words in the column like 'made', 'in' and etc, you can use.str.replace to get rid of those words and then copy to the country column.

Then the second part is to standardise country names. That requires a list or dic to compare with.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM