pandas.DataFrame：基于正则表达式聚合行

Question

What I want to do我想做的事

I have a trouble to clean my data because some values were not input correctly.我无法清理我的数据，因为某些值输入不正确。

import pandas as pd

data = [[1, 2], [2, 4], [3, 6], [4, 8], [5, 10]]
index = ['100: Test', '100: test', '101: FOO', '102: WWW', '101: foo foo']
columns = ['column1', 'column2']
df = pd.DataFrame(data, index=index, columns=columns)

print(df)
## Current output!!!!
#              column1  column2
#100: Test           1        2
#100: test           2        4
#101: FOO            3        6
#102: WWW            4        8
#101: foo foo        5       10

## DO SOMETHING!!!!

print(df)
## Expected output!!!!
#           column1  column2
#100: Test        2        4
#101: FOO         8       16
#102: WWW         4        8

My DataFrame.index consists of "ID" + "Name".我的DataFrame.index由“ID”+“名称”组成。 However, names are not correct, so one ID may show up in more than one row.但是，名称不正确，因此一个 ID 可能会出现在多行中。

Two requests两个请求

Sum up rows with the same ID.汇总具有相同 ID 的行。
Choose one name for the result.为结果选择一个名称。 (For example, I can use either "Test" or "test" for ID=100.) （例如，对于 ID=100，我可以使用“测试”或“测试”。）

What I tried我试过的

I tried to use groupby function, but it doesn't seem to have regex compatibility.我尝试使用groupby function，但它似乎没有regex兼容性。

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

df2 = df.groupby(level=0).sum()
print(df2)
## Output
#              column1  column2
#100: Test           1        2
#100: test           2        4
#101: FOO            3        6
#101: foo foo        5       10
#102: WWW            4        8

Environment环境

Python 3.10.5 Pandas 1.4.3 Python 3.10.5 Pandas 1.4.3

Answer 1

Your expected output for Test does not reflect that you are trying to do a summation, but from what I can gather this is what you want.您对Test的预期 output 并不反映您正在尝试进行求和，但据我所知，这就是您想要的。 groupby can take a function or a mapping or even a series as the by argument. groupby可以将 function 或映射甚至系列作为by参数。 Here, you just want the lowercase version of the index:在这里，您只需要索引的小写版本：

df.groupby(df.index.str.lower()).sum()

which gives这使

           column1  column2
100: test        3        6
101: foo         8       16
102: www         4        8

Here, what I've done is passed it the lowercase index, and it simply groups the rows based on matching elements in the series.在这里，我所做的是将小写索引传递给它，它只是根据系列中的匹配元素对行进行分组。

Edit编辑

Based on the updated question, to match the numbers, you can use regular expressions:根据更新的问题，要匹配数字，您可以使用正则表达式：

df.groupby(df.index.str.extract(r"(\d+):", expand=False)).sum()

which gives这使

     column1  column2
100        3        6
101        8       16
102        4        8

It isn't clear what would take precedence 101: foo foo or 101: FOO , it seems like the numbers here are the important part regardless.目前尚不清楚优先级101: foo foo或101: FOO ，似乎这里的数字无论如何都是重要的部分。

Answer 2

import numpy as np
import pandas as pd

# Data Import
data = [[1, 2], [2, 4], [3, 6], [4, 8], [5, 10]]
index = ['100: Test', '100: test', '101: FOO', '102: WWW', '101: foo']
columns = ['column1', 'column2']
df = pd.DataFrame(data, index=index, columns=columns)

# Data Pre-process
df.reset_index(inplace=True)
df.rename(columns={'index':'ID_Name'},inplace=True)
df['ID'] = df['ID_Name'].str.split(':').str[0]
df.sort_values(['ID','ID_Name'],inplace=True)
df_group = df.groupby(['ID'])[['column1','column2']].sum().reset_index()
df_group
df = pd.merge(df,df_group,how='left',left_on='ID',right_on='ID')
df_final = df.groupby(['ID']).first()

# Data Clean Process
df_final.rename(columns={'column1_y':'column1','column2_y':'column2'},inplace= True)
df_final.drop(['column1_x','column2_x'],axis = 1 , inplace=True)

# Output Display
df_final

Hi Dmjy,嗨，Dmjy，

I have attached the code for you, please try from your side, and if you still have any question please let me know我已为您附上代码，请从您这边尝试，如果您仍有任何问题，请告诉我

Thanks Leon谢谢莱昂

pandas.DataFrame：基于正则表达式聚合行

问题描述

What I want to do我想做的事

What I tried我试过的

Environment环境

2 个解决方案

解决方案1
0 2022-08-31 15:15:23

Edit编辑

解决方案2
0 2022-08-31 15:39:57

pandas.DataFrame：基于正则表达式聚合行

问题描述

What I want to do我想做的事

What I tried我试过的

Environment环境

2 个解决方案

解决方案1 0 2022-08-31 15:15:23

Edit编辑

解决方案2 0 2022-08-31 15:39:57

解决方案1
0 2022-08-31 15:15:23

解决方案2
0 2022-08-31 15:39:57