I have an excel.csv file that contains a column called Made From
that identifies the country/region eg Made in Australia
or Made in HK
or even with region Made in APAC
So my goal is to identify no matter which ISO country code or region name it is using. I want to put it into a new column called Country and another column called Region respectively.
Currently, I am trying this code to no avail. df[Country] = df["Made From"].apply(lamda x:x if x in countries else "Global")
. The countries
is an array of countries = ["Australia", "Mexico"...]
that I wrote a bit of but no sure is there is a better way or solution out there that has a full list of all ISO codes and region names. If there are no standard region name, I could always do up a list as it is only naming a few regions.
Please help me on this as I am stuck here. Please let me know if there is any more clarification needed. Thank you.
I am coding in Python
UPDATE: As requested, input data and expected output.
input
|Made From |
-------------------
|Made in Australia|
|Made in HK |
|Made in APAC |
|UK Made |
Expected Output
|Made From |Country |Region|
------------------------------------------
|Made in Australia|Australia |APAC |
|Made in HK |Hong Kong |APAC |
|Made in APAC | |APAC |
|UK Made |United Kingdom|Europe|
I think your approach is ok if you can manage to create a complete countries list. Here is a small example:
import pandas as pd
import pycountry_convert as pc
import pycountry
from itertools import compress
continents = {
'NA': 'North America',
'SA': 'South America',
'AS': 'APAC',
'OC': 'APAC',
'AF': 'Africa',
'EU': 'Europe'
}
regions = ["APAC", "EMEA", "NA", "SA"]
lc = list(pycountry.countries)
#It seems UK is not part of the ISO
exceptions = {"UK": "EMEA"}
countries = [lc[i].name for i in range(len(lc))] + \
[lc[i].alpha_2 for i in range(len(lc))] + \
[lc[i].alpha_3 for i in range(len(lc))] + ["UK"]
dataset = pd.DataFrame([[1,"Made in Australia"],[1,"Made in HK"],[1,"Made in APAC"], [1,"UK Made"]],
columns = ["Other Columns", "Made From"])
# Assuming proper division of words
dataset["Countries"] = dataset["Made From"].apply(lambda x: [list(compress(x.split(" "), [x.split(" ")[i] in countries for i in range(len(x.split(" ")))])) + ["Global"]][0][0])
dataset["Regions"] = dataset["Made From"].apply(lambda x: [list(compress(x.split(" "), [x.split(" ")[i] in regions for i in range(len(x.split(" ")))])) + ["Local"]][0][0])
for i in range(len(dataset)):
if dataset.iloc[i,3] == "Local":
try:
c = pc.country_name_to_country_alpha2(dataset.iloc[i,2], cn_name_format="default")
c = pc.country_alpha2_to_continent_code(c)
dataset.iat[i,3] = c
except:
try:
c = pc.country_alpha2_to_continent_code(dataset.iloc[i,2])
dataset.iat[i,3] = c
except:
try:
c = pc.country_alpha3_to_continent_code(dataset.iloc[i,2])
dataset.iat[i,3] = c
except:
if dataset.iloc[i,2] in list(exceptions.keys()):
dataset.iat[i,3] = exceptions[dataset.iloc[i,2]]
try:
dataset.iat[i,3] = continents[dataset.iloc[i,3]]
except:
pass
dataset
Other Columns Made From Countries Regions
0 1 Made in Australia Australia APAC
1 1 Made in HK HK APAC
2 1 Made in APAC Global APAC
3 1 UK Made UK EMEA
Alternatively, if you see that there are small number of non-country words in the column like 'made', 'in' and etc, you can use.str.replace to get rid of those words and then copy to the country column.
Then the second part is to standardise country names. That requires a list or dic to compare with.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.