简体   繁体   中英

Get full language name from iso-639 code using pycountry

I have the following problem I am trying to solve. I have a list of iso-639 languages that I retreived using langdetect with the following code

def try_detect(cell):
    try:
        detected_lang = detect(cell)
    except:
        detected_lang = None
    return detected_lang

Spotify['language'] = Spotify['artists'].apply(try_detect)
Spotify['language']  = Spotify['language'].str.upper() 
Spotify['language'].unique()

which returned

array(['DE', 'PL', 'ES', 'EN', 'NL', 'TR', 'FR', 'IT', 'SK', 'RO', 'SW',
       'FI', 'AF', 'EL', 'ID', 'LT', 'CA', 'TL', 'PT', 'HR', 'RU', 'NO',
       'DA', 'SL', 'CY', 'SQ', 'KO', 'SO', 'CS', 'ET', 'ZH-CN', 'SV',
       'HU', 'LV', 'VI', 'JA', None, 'AR', 'TH', 'BG'], dtype=object)

Although that would be sufficient, I'd love to have the full language name in another column. But, I do not seem to be able to get this right. I know that

pycountry.languages.get(alpha_2='FR').name

returns French . I this tried:

Languages = Spotify['language'].unique()
LANG = []
for lang in Languages:
    Lang = pycountry.languages.get(alpha_2=lang).name
    LANG.append(Lang)

but I keep getting the error:

AttributeError: 'NoneType' object has no attribute 'name'

I'm at a loss there. Any help to put me on the right track would be greatl appriciated.

I have managed to answer this question by noticing that not all unique values of Spotify['language'].unique() actually were iso-369 language codes. I this replaced

Languages = Spotify['language'].unique()
LANG = []
for lang in Languages:
    Lang = pycountry.languages.get(alpha_2=lang).name
    LANG.append(Lang)

by

LANG = []
for lang in Languages:
    try:
        Lang = pycountry.languages.get(alpha_2=lang).name
    except:
        Lang = None
    LANG.append(Lang)

An alternative solution was offered by @cs95 (Thanks a lot) in the comment above, as

Languages = Spotify['language'].dropna().unique()

both return

['German',
 'Polish',
 'Spanish',
 'English',
 'Dutch',
 'Turkish',
 'French',
 'Italian',
 'Slovak',
 'Romanian',
 'Swahili (macrolanguage)',
 'Finnish',
 'Afrikaans',
 'Modern Greek (1453-)',
 'Indonesian',
 'Lithuanian',
 'Catalan',
 'Tagalog',
 'Portuguese',
 'Croatian',
 'Russian',
 'Norwegian',
 'Danish',
 'Slovenian',
 'Welsh',
 'Albanian',
 'Korean',
 'Somali',
 'Czech',
 'Estonian',
 None,
 'Swedish',
 'Hungarian',
 'Latvian',
 'Vietnamese',
 'Japanese',
 None,
 'Arabic',
 'Thai',
 'Bulgarian']

Note that ZH-CN is not found. This has to be done manually:

d = {'Language':Languages, 'Language_name':LANG}
LANGUAGE_NAMES = pd.DataFrame(d)
LANGUAGE_NAMES['Language_name'] = np.where(LANGUAGE_NAMES['Language'] == 'ZH-CN', 'Chinese', LANGUAGE_NAMES['Language_name'])

which gives

  Language            Language_name
0        DE                   German
1        PL                   Polish
2        ES                  Spanish
3        EN                  English
4        NL                    Dutch
5        TR                  Turkish
6        FR                   French
7        IT                  Italian
8        SK                   Slovak
9        RO                 Romanian
10       SW  Swahili (macrolanguage)
11       FI                  Finnish
12       AF                Afrikaans
13       EL     Modern Greek (1453-)
14       ID               Indonesian
15       LT               Lithuanian
16       CA                  Catalan
17       TL                  Tagalog
18       PT               Portuguese
19       HR                 Croatian
20       RU                  Russian
21       NO                Norwegian
22       DA                   Danish
23       SL                Slovenian
24       CY                    Welsh
25       SQ                 Albanian
26       KO                   Korean
27       SO                   Somali
28       CS                    Czech
29       ET                 Estonian
30    ZH-CN                     None
31       SV                  Swedish
32       HU                Hungarian
33       LV                  Latvian
34       VI               Vietnamese
35       JA                 Japanese
36     None                     None
37       AR                   Arabic
38       TH                     Thai
39       BG                Bulgarian

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM