[英]How to group Pandas data frame by column with regex match
I have the following data frame: 我有以下数据框:
import pandas as pd
df = pd.DataFrame({'id':['a','b','c','d','e'],
'XX_111_S5_R12_001_Mobile_05':[-14,-90,-90,-96,-91],
'YY_222_S00_R12_001_1-999_13':[-103,0,-110,-114,-114],
'ZZ_111_S00_R12_001_1-999_13':[1,2.3,3,5,6],
})
df.set_index('id',inplace=True)
df
Which looks like this: 看起来像这样:
Out[6]:
XX_111_S5_R12_001_Mobile_05 YY_222_S00_R12_001_1-999_13 ZZ_111_S00_R12_001_1-999_13
id
a -14 -103 1.0
b -90 0 2.3
c -90 -110 3.0
d -96 -114 5.0
e -91 -114 6.0
What I want to do is to group the column based on the following regex: 我想要做的是根据以下正则表达式对列进行分组:
\w+_\w+_\w+_\d+_([\w\d-]+)_\d+
So that in the end it's grouped by Mobile
, and 1-999
. 所以最终它被
Mobile
和1-999
分组。
What's the way to do it. 有什么办法呢。 I tried this but fail to group them:
我尝试了这个,但未能将它们分组:
import re
grouped = df.groupby(lambda x: re.search("\w+_\w+_\w+_\d+_([\w\d-]+)_\d+", x).group(), axis=1)
for name, group in grouped:
print name
print group
Which prints: 哪个印刷品:
XX_111_S5_R12_001_Mobile_05
YY_222_S00_R12_001_1-999_13
ZZ_111_S00_R12_001_1-999_13
What we want is name
prints to: 我们想要的是
name
打印到:
Mobile
1-999
1-999
And group
prints the corresponding data frame. 并且
group
打印相应的数据框。
You can use .str.extract
on the columns in order to extract substrings for your groupby
: 您可以在列上使用
.str.extract
,以便为您的groupby
提取子字符串 :
# Performing the groupby.
pat = '\w+_\w+_\w+_\d+_([\w\d-]+)_\d+'
grouped = df.groupby(df.columns.str.extract(pat, expand=False), axis=1)
# Showing group information.
for name, group in grouped:
print name
print group, '\n'
Which returns the expected groups: 返回预期的组:
1-999
YY_222_S00_R12_001_1-999_13 ZZ_111_S00_R12_001_1-999_13
id
a -103 1.0
b 0 2.3
c -110 3.0
d -114 5.0
e -114 6.0
Mobile
XX_111_S5_R12_001_Mobile_05
id
a -14
b -90
c -90
d -96
e -91
分组后,将新数据帧的索引设置为[re.findall(r'\\w+_\\w+_\\w+_\\d+_([\\w\\d-]+)_\\d+', col)[0] for col in df.columns]
( ['Mobile', '1-999', '1-999']
)。
You have some issues with your regex, \\w
matches word characters which include underscore, and that doesn't seem like what you want, if you just want to match letters and digits, using A-Za-z0-9-
would be better: 你的正则表达式有一些问题,
\\w
匹配包含下划线的单词字符,这看起来不像你想要的,如果你只想匹配字母和数字,使用A-Za-z0-9-
会更好:
df.groupby(df.columns.str.extract("([A-Za-z0-9-]+)_\d+$"), axis=1).sum()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.