简体   繁体   English

Python 使用 Pandas 进行数据清理

[英]Python Data Cleaning with Pandas

My dataset looks like this我的数据集看起来像这样

Name    Subset    Value
A       A         14
A       AB        187
A       AC        5465
S       S         454
S       SO        54
S       SH        784
X       X         455
X       XB        4545
X       XC        854
Y       Y         45
Y       YB        98
Y       YC        4566
L       L         78
L       LP        12 
L       LX        655 

I want to keep only those groups of rows whose "subset" columns are of pattern;我只想保留那些“子集”列具有模式的行组; * , *B , *C (Note: * is any alphabet or digit). * , *B , *C (注意: *是任何字母或数字)。

For example, output should look like例如,output 应该看起来像

Name    Subset    Value
A       A         14
A       AB        187
A       AC        5465
X       X         455
X       XB        4545
X       XC        854
Y       Y         45
Y       YB        98
Y       YC        4566

PS Actual dataset can be of many columns/rows. PS 实际数据集可以有很多列/行。

According to your definition, column with subset 'S' or 'L' should be in the output, as it matches one single character.根据您的定义,带有子集“S”或“L”的列应该在 output 中,因为它匹配一个字符。

This code should work:此代码应该可以工作:

# Create a list of allowed regex pattern. Easy to extend
allow = [
    '.', 
    '.B', 
    '.C'
]

# Put your pattern into (), so called regex match groups
allow = [f'({a})' for a in allow]

# Use the Pipe operator, to check, if one of the pattern matches
allow = '|'.join(allow)

# Now select only lines, where 'Subset' matches the pattern.
df = df.loc[df.Subset.str.match(f'^({allow})$')]
print(df)

Try this:尝试这个:

new_df = df.groupby('Name').filter(lambda group: group['Subset'].str.match(r'^[a-zA-Z0-9]([BC])?$').all())

Output: Output:

>>> new_df
   Name Subset  Value
0     A      A     14
1     A     AB    187
2     A     AC   5465
6     X      X    455
7     X     XB   4545
8     X     XC    854
9     Y      Y     45
10    Y     YB     98
11    Y     YC   4566

Explanation:解释:

  1. Split the dataframe into groups per each unique Name , so the first group will contain all A rows, etc.根据每个唯一的Name将 dataframe 拆分为组,因此第一组将包含所有A行,依此类推。
  2. Keep only groups where ALL (indicated by .all() ) values of the Subset column match the regular expression ^[a-zA-Z0-9]([BC])?$ (which read: one of a through z , or A through Z , or 0 through 9 at the beginning of the string ( ^ ); and then optionally one of B or C , at the end of the string ( $ ))仅保留Subset列的 ALL(由.all()表示)值与正则表达式匹配的^[a-zA-Z0-9]([BC])?$ (读取: az之一,或字符串开头的AZ09 ( ^ );然后是字符串末尾的可选BC之一 ( $ ))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM