[英]pandas.DataFrame: aggregate rows based on regex
I have a trouble to clean my data because some values were not input correctly.我无法清理我的数据,因为某些值输入不正确。
import pandas as pd
data = [[1, 2], [2, 4], [3, 6], [4, 8], [5, 10]]
index = ['100: Test', '100: test', '101: FOO', '102: WWW', '101: foo foo']
columns = ['column1', 'column2']
df = pd.DataFrame(data, index=index, columns=columns)
print(df)
## Current output!!!!
# column1 column2
#100: Test 1 2
#100: test 2 4
#101: FOO 3 6
#102: WWW 4 8
#101: foo foo 5 10
## DO SOMETHING!!!!
print(df)
## Expected output!!!!
# column1 column2
#100: Test 2 4
#101: FOO 8 16
#102: WWW 4 8
My DataFrame.index
consists of "ID" + "Name".我的DataFrame.index
由“ID”+“名称”组成。 However, names are not correct, so one ID may show up in more than one row.但是,名称不正确,因此一个 ID 可能会出现在多行中。
Two requests两个请求
I tried to use groupby
function, but it doesn't seem to have regex
compatibility.我尝试使用groupby
function,但它似乎没有regex
兼容性。
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
df2 = df.groupby(level=0).sum()
print(df2)
## Output
# column1 column2
#100: Test 1 2
#100: test 2 4
#101: FOO 3 6
#101: foo foo 5 10
#102: WWW 4 8
Python 3.10.5 Pandas 1.4.3 Python 3.10.5 Pandas 1.4.3
Your expected output for Test
does not reflect that you are trying to do a summation, but from what I can gather this is what you want.您对Test
的预期 output 并不反映您正在尝试进行求和,但据我所知,这就是您想要的。 groupby
can take a function or a mapping or even a series as the by
argument. groupby
可以将 function 或映射甚至系列作为by
参数。 Here, you just want the lowercase version of the index:在这里,您只需要索引的小写版本:
df.groupby(df.index.str.lower()).sum()
which gives这使
column1 column2
100: test 3 6
101: foo 8 16
102: www 4 8
Here, what I've done is passed it the lowercase index, and it simply groups the rows based on matching elements in the series.在这里,我所做的是将小写索引传递给它,它只是根据系列中的匹配元素对行进行分组。
Based on the updated question, to match the numbers, you can use regular expressions:根据更新的问题,要匹配数字,您可以使用正则表达式:
df.groupby(df.index.str.extract(r"(\d+):", expand=False)).sum()
which gives这使
column1 column2
100 3 6
101 8 16
102 4 8
It isn't clear what would take precedence 101: foo foo
or 101: FOO
, it seems like the numbers here are the important part regardless.目前尚不清楚优先级101: foo foo
或101: FOO
,似乎这里的数字无论如何都是重要的部分。
import numpy as np
import pandas as pd
# Data Import
data = [[1, 2], [2, 4], [3, 6], [4, 8], [5, 10]]
index = ['100: Test', '100: test', '101: FOO', '102: WWW', '101: foo']
columns = ['column1', 'column2']
df = pd.DataFrame(data, index=index, columns=columns)
# Data Pre-process
df.reset_index(inplace=True)
df.rename(columns={'index':'ID_Name'},inplace=True)
df['ID'] = df['ID_Name'].str.split(':').str[0]
df.sort_values(['ID','ID_Name'],inplace=True)
df_group = df.groupby(['ID'])[['column1','column2']].sum().reset_index()
df_group
df = pd.merge(df,df_group,how='left',left_on='ID',right_on='ID')
df_final = df.groupby(['ID']).first()
# Data Clean Process
df_final.rename(columns={'column1_y':'column1','column2_y':'column2'},inplace= True)
df_final.drop(['column1_x','column2_x'],axis = 1 , inplace=True)
# Output Display
df_final
Hi Dmjy,嗨,Dmjy,
I have attached the code for you, please try from your side, and if you still have any question please let me know我已为您附上代码,请从您这边尝试,如果您仍有任何问题,请告诉我
Thanks Leon谢谢莱昂
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.