简体   繁体   中英

pandas.DataFrame: aggregate rows based on regex

What I want to do

I have a trouble to clean my data because some values were not input correctly.

import pandas as pd

data = [[1, 2], [2, 4], [3, 6], [4, 8], [5, 10]]
index = ['100: Test', '100: test', '101: FOO', '102: WWW', '101: foo foo']
columns = ['column1', 'column2']
df = pd.DataFrame(data, index=index, columns=columns)

print(df)
## Current output!!!!
#              column1  column2
#100: Test           1        2
#100: test           2        4
#101: FOO            3        6
#102: WWW            4        8
#101: foo foo        5       10

## DO SOMETHING!!!!

print(df)
## Expected output!!!!
#           column1  column2
#100: Test        2        4
#101: FOO         8       16
#102: WWW         4        8

My DataFrame.index consists of "ID" + "Name". However, names are not correct, so one ID may show up in more than one row.

Two requests

  • Sum up rows with the same ID.
  • Choose one name for the result. (For example, I can use either "Test" or "test" for ID=100.)

What I tried

I tried to use groupby function, but it doesn't seem to have regex compatibility.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

df2 = df.groupby(level=0).sum()
print(df2)
## Output
#              column1  column2
#100: Test           1        2
#100: test           2        4
#101: FOO            3        6
#101: foo foo        5       10
#102: WWW            4        8

Environment

Python 3.10.5 Pandas 1.4.3

Your expected output for Test does not reflect that you are trying to do a summation, but from what I can gather this is what you want. groupby can take a function or a mapping or even a series as the by argument. Here, you just want the lowercase version of the index:

df.groupby(df.index.str.lower()).sum()

which gives

           column1  column2
100: test        3        6
101: foo         8       16
102: www         4        8

Here, what I've done is passed it the lowercase index, and it simply groups the rows based on matching elements in the series.

Edit

Based on the updated question, to match the numbers, you can use regular expressions:

df.groupby(df.index.str.extract(r"(\d+):", expand=False)).sum()

which gives

     column1  column2
100        3        6
101        8       16
102        4        8

It isn't clear what would take precedence 101: foo foo or 101: FOO , it seems like the numbers here are the important part regardless.

import numpy as np
import pandas as pd

# Data Import
data = [[1, 2], [2, 4], [3, 6], [4, 8], [5, 10]]
index = ['100: Test', '100: test', '101: FOO', '102: WWW', '101: foo']
columns = ['column1', 'column2']
df = pd.DataFrame(data, index=index, columns=columns)

# Data Pre-process
df.reset_index(inplace=True)
df.rename(columns={'index':'ID_Name'},inplace=True)
df['ID'] = df['ID_Name'].str.split(':').str[0]
df.sort_values(['ID','ID_Name'],inplace=True)
df_group = df.groupby(['ID'])[['column1','column2']].sum().reset_index()
df_group
df = pd.merge(df,df_group,how='left',left_on='ID',right_on='ID')
df_final = df.groupby(['ID']).first()

# Data Clean Process
df_final.rename(columns={'column1_y':'column1','column2_y':'column2'},inplace= True)
df_final.drop(['column1_x','column2_x'],axis = 1 , inplace=True)

# Output Display
df_final

Hi Dmjy,

I have attached the code for you, please try from your side, and if you still have any question please let me know

Thanks Leon

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM