简体   繁体   English

如何使用正则表达式匹配按列对Pandas数据进行分组

[英]How to group Pandas data frame by column with regex match

I have the following data frame: 我有以下数据框:

import pandas as pd
df = pd.DataFrame({'id':['a','b','c','d','e'],
                   'XX_111_S5_R12_001_Mobile_05':[-14,-90,-90,-96,-91],
                   'YY_222_S00_R12_001_1-999_13':[-103,0,-110,-114,-114],
                   'ZZ_111_S00_R12_001_1-999_13':[1,2.3,3,5,6],
})

df.set_index('id',inplace=True)
df

Which looks like this: 看起来像这样:

Out[6]:
    XX_111_S5_R12_001_Mobile_05  YY_222_S00_R12_001_1-999_13  ZZ_111_S00_R12_001_1-999_13
id
a                           -14                         -103                          1.0
b                           -90                            0                          2.3
c                           -90                         -110                          3.0
d                           -96                         -114                          5.0
e                           -91                         -114                          6.0

What I want to do is to group the column based on the following regex: 我想要做的是根据以下正则表达式对列进行分组:

\w+_\w+_\w+_\d+_([\w\d-]+)_\d+

So that in the end it's grouped by Mobile , and 1-999 . 所以最终它被Mobile1-999分组。

What's the way to do it. 有什么办法呢。 I tried this but fail to group them: 我尝试了这个,但未能将它们分组:

import re
grouped = df.groupby(lambda x: re.search("\w+_\w+_\w+_\d+_([\w\d-]+)_\d+", x).group(), axis=1)
for name, group in grouped:
    print name
    print group

Which prints: 哪个印刷品:

XX_111_S5_R12_001_Mobile_05
YY_222_S00_R12_001_1-999_13
ZZ_111_S00_R12_001_1-999_13

What we want is name prints to: 我们想要的是name打印到:

Mobile
1-999
1-999

And group prints the corresponding data frame. 并且group打印相应的数据框。

You can use .str.extract on the columns in order to extract substrings for your groupby : 您可以在列上使用.str.extract ,以便为您的groupby 提取子字符串

# Performing the groupby.
pat = '\w+_\w+_\w+_\d+_([\w\d-]+)_\d+'
grouped = df.groupby(df.columns.str.extract(pat, expand=False), axis=1)

# Showing group information.
for name, group in grouped:
    print name
    print group, '\n'

Which returns the expected groups: 返回预期的组:

1-999
    YY_222_S00_R12_001_1-999_13  ZZ_111_S00_R12_001_1-999_13
id                                                          
a                          -103                          1.0
b                             0                          2.3
c                          -110                          3.0
d                          -114                          5.0
e                          -114                          6.0 

Mobile
    XX_111_S5_R12_001_Mobile_05
id                             
a                           -14
b                           -90
c                           -90
d                           -96
e                           -91 

分组后,将新数据帧的索引设置为[re.findall(r'\\w+_\\w+_\\w+_\\d+_([\\w\\d-]+)_\\d+', col)[0] for col in df.columns]['Mobile', '1-999', '1-999'] )。

You have some issues with your regex, \\w matches word characters which include underscore, and that doesn't seem like what you want, if you just want to match letters and digits, using A-Za-z0-9- would be better: 你的正则表达式有一些问题, \\w匹配包含下划线的单词字符,这看起来不像你想要的,如果你只想匹配字母和数字,使用A-Za-z0-9-会更好:

df.groupby(df.columns.str.extract("([A-Za-z0-9-]+)_\d+$"), axis=1).sum()

在此输入图像描述

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何按Pandas数据框中的列值进行分组 - How to Group by column value in Pandas Data frame 如何按列的值对pandas数据帧的行进行分组? - How do I group the rows of a pandas data frame by a value of a column? 熊猫分组数据框并按列值排序 - Pandas group data frame and sort by column value 如何根据 pandas 中的条件匹配从另一个数据帧更新数据帧列值 - How to update the data frame column values from another data frame based a conditional match in pandas 如何使用'group by'和'cut'方法在pandas数据框中使用连续分布按一系列列值分组? - how to group by a range of column values using continuous distribution in pandas data frame using 'group by' and 'cut' method? 如何删除 pandas 数据框列中与另一列中的单词匹配的单词 - How to remove words in pandas data frame column which match with words in another column Pandas 数据框 - 对列值进行分组,然后随机化该列的新值 - Pandas data frame - Group a column values then Randomize new values of that column 为 '0' 和 '1' 的 Pandas 数据框列创建一个 'group number' 列 - create a 'group number' column for a pandas data frame column of '0' and '1' s 按 pandas 中的多列数据框分组并获取列的平均值 - Group by multiple column data frame in pandas and get mean value of a column python - 如何在python pandas中分组并取一列的计数除以数据框第二列的唯一计数? - How to do group by and take Count of one column divide by count of unique of second column of data frame in python pandas?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM