Python 使用 Pandas 进行数据清理

Question

My dataset looks like this我的数据集看起来像这样

Name    Subset    Value
A       A         14
A       AB        187
A       AC        5465
S       S         454
S       SO        54
S       SH        784
X       X         455
X       XB        4545
X       XC        854
Y       Y         45
Y       YB        98
Y       YC        4566
L       L         78
L       LP        12 
L       LX        655

I want to keep only those groups of rows whose "subset" columns are of pattern;我只想保留那些“子集”列具有模式的行组； * , *B , *C (Note: * is any alphabet or digit). * , *B , *C （注意： *是任何字母或数字）。

For example, output should look like例如，output 应该看起来像

Name    Subset    Value
A       A         14
A       AB        187
A       AC        5465
X       X         455
X       XB        4545
X       XC        854
Y       Y         45
Y       YB        98
Y       YC        4566

PS Actual dataset can be of many columns/rows. PS 实际数据集可以有很多列/行。

Answer 1

According to your definition, column with subset 'S' or 'L' should be in the output, as it matches one single character.根据您的定义，带有子集“S”或“L”的列应该在 output 中，因为它匹配一个字符。

This code should work:此代码应该可以工作：

# Create a list of allowed regex pattern. Easy to extend
allow = [
    '.', 
    '.B', 
    '.C'
]

# Put your pattern into (), so called regex match groups
allow = [f'({a})' for a in allow]

# Use the Pipe operator, to check, if one of the pattern matches
allow = '|'.join(allow)

# Now select only lines, where 'Subset' matches the pattern.
df = df.loc[df.Subset.str.match(f'^({allow})$')]
print(df)

Answer 2

Try this:尝试这个：

new_df = df.groupby('Name').filter(lambda group: group['Subset'].str.match(r'^[a-zA-Z0-9]([BC])?$').all())

Output: Output：

>>> new_df
   Name Subset  Value
0     A      A     14
1     A     AB    187
2     A     AC   5465
6     X      X    455
7     X     XB   4545
8     X     XC    854
9     Y      Y     45
10    Y     YB     98
11    Y     YC   4566

Explanation:解释：

Split the dataframe into groups per each unique Name , so the first group will contain all A rows, etc.根据每个唯一的Name将 dataframe 拆分为组，因此第一组将包含所有A行，依此类推。
Keep only groups where ALL (indicated by .all() ) values of the Subset column match the regular expression ^[a-zA-Z0-9]([BC])?$ (which read: one of a through z , or A through Z , or 0 through 9 at the beginning of the string ( ^ ); and then optionally one of B or C , at the end of the string ( $ ))仅保留Subset列的 ALL（由.all()表示）值与正则表达式匹配的^[a-zA-Z0-9]([BC])?$ （读取： a到z之一，或字符串开头的A到Z或0到9 ( ^ )；然后是字符串末尾的可选B或C之一 ( $ )）

Python 使用 Pandas 进行数据清理

问题描述

1 个解决方案

解决方案1
1 2022-01-02 17:41:35

解决方案2
0 2022-01-02 18:07:10

Python 使用 Pandas 进行数据清理

问题描述

1 个解决方案

解决方案1 1 2022-01-02 17:41:35

解决方案2 0 2022-01-02 18:07:10

解决方案1
1 2022-01-02 17:41:35

解决方案2
0 2022-01-02 18:07:10