简体   繁体   中英

Pandas DataFrame: Converting Column of String into Column of Lists

I currently have a dataframe which contains several columns like this below:

print(df.WIN_COUNTRY_CODE[180:200])

           WIN_COUNTRY_CODE
180                        IT
181                        IT
182                        ES
183    DE---UK---UK---UK---UK
184         UK---UK---UK---UK
185         DE---UK---UK---UK
186    UK---UK---DE---UK---UK
187                        SI
188                        UK
189                        FR

Each cells of the column contain country codes, which can be more than one for each record. Since I would like to convert the country code from 2-letter into 3-letter iso code and also calculate the appearance frequency for this country, i apply this code:

1. I split the string by the 3-dash that separates the countrycodes to convert from string to list:

df['WIN_COUNTRY_CODE_2'] = df['WIN_COUNTRY_CODE'].str.split("---")

This results in the column to be like this:

print(df.WIN_COUNTRY_CODE[180:200])

           WIN_COUNTRY_CODE
180                            ['IT']
181                            ['IT']
182                            ['ES']
183    ['DE', 'UK', 'UK', 'UK', 'UK']
184          ['UK', 'UK', 'UK', 'UK']
185          ['DE', 'UK', 'UK', 'UK']
186    ['UK', 'UK', 'DE', 'UK', 'UK']
187                            ['SI']
188                            ['UK']
189                            ['FR']

2. I apply the mapping method to convert from 2-letter to 3-letter country codes from conversion table that (cattable) and make it a dictionary type (catdict)

catdict= dict([(iso2,iso3) for iso2,iso3 in zip(cattable['iso_2_codes'], cattable['iso_3_codes'])])
df.assign(mapped=[[catdict[k] for k in row if catdict.get(k)] for row in df.WIN_COUNTRY_CODE_2])

However whenever I apply the mapping it always return me this statement:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-13-df7aad8ca868> in <module>
      1 cattable = pd.ExcelFile('D:/ROBERT LIBRARIES/Documents/ISD - LKPP Project/vardesc2.xlsx').parse('WIN_COUNTRY_CODE')
      2 catdict= dict([(catnum,catdesc) for catnum,catdesc in zip(cattable['WIN_COUNTRY_CODE'], cattable['Description'])])
----> 3 df.assign(mapped=[[catdict[k] for k in row if catdict.get(k)] for row in df.WIN_COUNTRY_CODE])

<ipython-input-13-df7aad8ca868> in <listcomp>(.0)
      1 cattable = pd.ExcelFile('D:/ROBERT LIBRARIES/Documents/ISD - LKPP Project/vardesc2.xlsx').parse('WIN_COUNTRY_CODE')
      2 catdict= dict([(catnum,catdesc) for catnum,catdesc in zip(cattable['WIN_COUNTRY_CODE'], cattable['Description'])])
----> 3 df.assign(mapped=[[catdict[k] for k in row if catdict.get(k)] for row in df.WIN_COUNTRY_CODE])

TypeError: 'float' object is not iterable


It seems likely that the code returns an error as the entries in the WIN_COUNTRY_CODE column are still in a string format, instead of a list of strings. This I learn after inspecting the objects within the list by this code:

df.WIN_COUNTRY_CODE_2[183][0]

it always return one character instead of the 2-letter code as a string-object.

'['

whereas I expect the code to return a 'DE' object.

Question:

How to convert the WIN_COUNTRY_CODE column from a column of list into a column of list? And how can I find the most frequent country in the entire column? Thank you.

df1=df.copy()
df1["WIN_COUNTRY_CODE"]=df['WIN_COUNTRY_CODE'].str.split('---')
df1["Max_code"]=df1["WIN_COUNTRY_CODE"].apply(lambda x: max(set(x), key = x.count))

output

在此处输入图片说明

This might help.

df['new_WIN_COUNTRY_CODE']=df['WIN_COUNTRY_CODE'].map(lambda x: x.split("---") if "---" in x else [x])

print(df)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM