简体   繁体   中英

How to group string values in python pandas dataframe?

I am trying to do some string grouping on a csv file using python pandas dataframes. My input file CSV looks like this:

my_file.csv :

country_code,zipcode,company_counts
CA,653681,{KFC: 1} 
CA,66936,
CA,66936,{Pizza Hut: 1} 
CA,66936,{KYD: 1} 
CA,66936,{Taco: 1} 
CA,653681,{MCD: 2}
CA,722373,{Royal: 'Bank: 1}

What i have so far :

import pandas as pd

dataframes = []
df = pd.read_csv("my_file.csv")
df.dropna().groupby(['country_code','zipcode'],as_index=Flase)['company_counts'].agg(', '.join)
dataframes.append(df)
print(dataframes[0].head())

What it prints :

  country_code  zipcode  company_counts
0           CA   653681        {KFC: 1}
1           CA    66936             NaN
2           CA    66936  {Pizza Hut: 1}
3           CA    66936        {KYD: 1}
4           CA    66936       {Taco: 1}

What I want (ideal solution or close enough):

country_code,zipcode,company_counts
CA,653681,{KFC: 1},{MCD: 2}
CA,66936,{Pizza Hut: 1},{KYD: 1},{Taco: 1}
CA,722373,{Royal: 'Bank: 1}

As mentioned in the comments:

In [88]: df.dropna().groupby(['country_code','zipcode'])['company_counts'].agg(', '.join)
Out[88]: 
country_code  zipcode
CA            66936      {Pizza Hut: 1}, {KYD: 1}, {Taco: 1}
              653681                      {KFC: 1}, {MCD: 2}
              722373                         {Royal Bank: 1}
Name: company_counts, dtype: object

However this does not enable you to group by company (eg to sum all companies in a state) or sum the counts of different companies. The country_codes are read in as strings. To convert them, you could do something like

In [91]: df.dropna().apply(lambda s: s[['country_code', zipcode']].append(pd.Series(s['company_counts'].strip(' {}').split(':'), index=['company', 'count'])), axis=1)
Out[91]: 
  country_code  zipcode     company count
0           CA   653681         KFC     1
2           CA    66936   Pizza Hut     1
3           CA    66936         KYD     1
4           CA    66936        Taco     1
5           CA   653681         MCD     2
6           CA   722373  Royal Bank     1

This is far easier to work with. Pandas is not designed to manage somethin like lists inside a data cell.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM