[英]Python Data Cleaning with Pandas
My dataset looks like this我的数据集看起来像这样
Name Subset Value
A A 14
A AB 187
A AC 5465
S S 454
S SO 54
S SH 784
X X 455
X XB 4545
X XC 854
Y Y 45
Y YB 98
Y YC 4566
L L 78
L LP 12
L LX 655
I want to keep only those groups of rows whose "subset" columns are of pattern;我只想保留那些“子集”列具有模式的行组;
*
, *B
, *C
(Note: *
is any alphabet or digit). *
, *B
, *C
(注意: *
是任何字母或数字)。
For example, output should look like例如,output 应该看起来像
Name Subset Value
A A 14
A AB 187
A AC 5465
X X 455
X XB 4545
X XC 854
Y Y 45
Y YB 98
Y YC 4566
PS Actual dataset can be of many columns/rows. PS 实际数据集可以有很多列/行。
According to your definition, column with subset 'S' or 'L' should be in the output, as it matches one single character.根据您的定义,带有子集“S”或“L”的列应该在 output 中,因为它匹配一个字符。
This code should work:此代码应该可以工作:
# Create a list of allowed regex pattern. Easy to extend
allow = [
'.',
'.B',
'.C'
]
# Put your pattern into (), so called regex match groups
allow = [f'({a})' for a in allow]
# Use the Pipe operator, to check, if one of the pattern matches
allow = '|'.join(allow)
# Now select only lines, where 'Subset' matches the pattern.
df = df.loc[df.Subset.str.match(f'^({allow})$')]
print(df)
Try this:尝试这个:
new_df = df.groupby('Name').filter(lambda group: group['Subset'].str.match(r'^[a-zA-Z0-9]([BC])?$').all())
Output: Output:
>>> new_df
Name Subset Value
0 A A 14
1 A AB 187
2 A AC 5465
6 X X 455
7 X XB 4545
8 X XC 854
9 Y Y 45
10 Y YB 98
11 Y YC 4566
Explanation:解释:
Name
, so the first group will contain all A
rows, etc.Name
将 dataframe 拆分为组,因此第一组将包含所有A
行,依此类推。.all()
) values of the Subset
column match the regular expression ^[a-zA-Z0-9]([BC])?$
(which read: one of a
through z
, or A
through Z
, or 0
through 9
at the beginning of the string ( ^
); and then optionally one of B
or C
, at the end of the string ( $
))Subset
列的 ALL(由.all()
表示)值与正则表达式匹配的^[a-zA-Z0-9]([BC])?$
(读取: a
到z
之一,或字符串开头的A
到Z
或0
到9
( ^
);然后是字符串末尾的可选B
或C
之一 ( $
))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.