如何使用正则表达式匹配按列对Pandas数据进行分组

Question

I have the following data frame: 我有以下数据框：

import pandas as pd
df = pd.DataFrame({'id':['a','b','c','d','e'],
                   'XX_111_S5_R12_001_Mobile_05':[-14,-90,-90,-96,-91],
                   'YY_222_S00_R12_001_1-999_13':[-103,0,-110,-114,-114],
                   'ZZ_111_S00_R12_001_1-999_13':[1,2.3,3,5,6],
})

df.set_index('id',inplace=True)
df

Which looks like this: 看起来像这样：

Out[6]:
    XX_111_S5_R12_001_Mobile_05  YY_222_S00_R12_001_1-999_13  ZZ_111_S00_R12_001_1-999_13
id
a                           -14                         -103                          1.0
b                           -90                            0                          2.3
c                           -90                         -110                          3.0
d                           -96                         -114                          5.0
e                           -91                         -114                          6.0

What I want to do is to group the column based on the following regex: 我想要做的是根据以下正则表达式对列进行分组：

\w+_\w+_\w+_\d+_([\w\d-]+)_\d+

So that in the end it's grouped by Mobile , and 1-999 . 所以最终它被Mobile和1-999分组。

What's the way to do it. 有什么办法呢。 I tried this but fail to group them: 我尝试了这个，但未能将它们分组：

import re
grouped = df.groupby(lambda x: re.search("\w+_\w+_\w+_\d+_([\w\d-]+)_\d+", x).group(), axis=1)
for name, group in grouped:
    print name
    print group

Which prints: 哪个印刷品：

XX_111_S5_R12_001_Mobile_05
YY_222_S00_R12_001_1-999_13
ZZ_111_S00_R12_001_1-999_13

What we want is name prints to: 我们想要的是name打印到：

Mobile
1-999
1-999

And group prints the corresponding data frame. 并且group打印相应的数据框。

Answer 1

You can use .str.extract on the columns in order to extract substrings for your groupby : 您可以在列上使用.str.extract ，以便为您的groupby 提取子字符串：

# Performing the groupby.
pat = '\w+_\w+_\w+_\d+_([\w\d-]+)_\d+'
grouped = df.groupby(df.columns.str.extract(pat, expand=False), axis=1)

# Showing group information.
for name, group in grouped:
    print name
    print group, '\n'

Which returns the expected groups: 返回预期的组：

1-999
    YY_222_S00_R12_001_1-999_13  ZZ_111_S00_R12_001_1-999_13
id                                                          
a                          -103                          1.0
b                             0                          2.3
c                          -110                          3.0
d                          -114                          5.0
e                          -114                          6.0 

Mobile
    XX_111_S5_R12_001_Mobile_05
id                             
a                           -14
b                           -90
c                           -90
d                           -96
e                           -91

Answer 2

分组后，将新数据帧的索引设置为[re.findall(r'\\w+_\\w+_\\w+_\\d+_([\\w\\d-]+)_\\d+', col)[0] for col in df.columns] （ ['Mobile', '1-999', '1-999'] ）。

Answer 3

You have some issues with your regex, \\w matches word characters which include underscore, and that doesn't seem like what you want, if you just want to match letters and digits, using A-Za-z0-9- would be better: 你的正则表达式有一些问题， \\w匹配包含下划线的单词字符，这看起来不像你想要的，如果你只想匹配字母和数字，使用A-Za-z0-9-会更好：

df.groupby(df.columns.str.extract("([A-Za-z0-9-]+)_\d+$"), axis=1).sum()

如何使用正则表达式匹配按列对Pandas数据进行分组

问题描述

3 个解决方案

解决方案1
6 已采纳 2017-03-27 02:08:30

解决方案2
1 2017-03-27 01:48:30

解决方案3
1 2017-03-27 02:10:35

如何使用正则表达式匹配按列对Pandas数据进行分组

问题描述

3 个解决方案

解决方案1 6 已采纳 2017-03-27 02:08:30

解决方案2 1 2017-03-27 01:48:30

解决方案3 1 2017-03-27 02:10:35

解决方案1
6 已采纳 2017-03-27 02:08:30

解决方案2
1 2017-03-27 01:48:30

解决方案3
1 2017-03-27 02:10:35