I am trying to do some string grouping on a csv file using python pandas dataframes. My input file CSV looks like this:
my_file.csv
:
country_code,zipcode,company_counts
CA,653681,{KFC: 1}
CA,66936,
CA,66936,{Pizza Hut: 1}
CA,66936,{KYD: 1}
CA,66936,{Taco: 1}
CA,653681,{MCD: 2}
CA,722373,{Royal: 'Bank: 1}
What i have so far
:
import pandas as pd
dataframes = []
df = pd.read_csv("my_file.csv")
df.dropna().groupby(['country_code','zipcode'],as_index=Flase)['company_counts'].agg(', '.join)
dataframes.append(df)
print(dataframes[0].head())
What it prints
:
country_code zipcode company_counts
0 CA 653681 {KFC: 1}
1 CA 66936 NaN
2 CA 66936 {Pizza Hut: 1}
3 CA 66936 {KYD: 1}
4 CA 66936 {Taco: 1}
What I want
(ideal solution or close enough):
country_code,zipcode,company_counts
CA,653681,{KFC: 1},{MCD: 2}
CA,66936,{Pizza Hut: 1},{KYD: 1},{Taco: 1}
CA,722373,{Royal: 'Bank: 1}
As mentioned in the comments:
In [88]: df.dropna().groupby(['country_code','zipcode'])['company_counts'].agg(', '.join)
Out[88]:
country_code zipcode
CA 66936 {Pizza Hut: 1}, {KYD: 1}, {Taco: 1}
653681 {KFC: 1}, {MCD: 2}
722373 {Royal Bank: 1}
Name: company_counts, dtype: object
However this does not enable you to group by company (eg to sum all companies in a state) or sum the counts of different companies. The country_codes are read in as strings. To convert them, you could do something like
In [91]: df.dropna().apply(lambda s: s[['country_code', zipcode']].append(pd.Series(s['company_counts'].strip(' {}').split(':'), index=['company', 'count'])), axis=1)
Out[91]:
country_code zipcode company count
0 CA 653681 KFC 1
2 CA 66936 Pizza Hut 1
3 CA 66936 KYD 1
4 CA 66936 Taco 1
5 CA 653681 MCD 2
6 CA 722373 Royal Bank 1
This is far easier to work with. Pandas is not designed to manage somethin like lists inside a data cell.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.