My dataset looks like this
Name Subset Value
A A 14
A AB 187
A AC 5465
S S 454
S SO 54
S SH 784
X X 455
X XB 4545
X XC 854
Y Y 45
Y YB 98
Y YC 4566
L L 78
L LP 12
L LX 655
I want to keep only those groups of rows whose "subset" columns are of pattern; *
, *B
, *C
(Note: *
is any alphabet or digit).
For example, output should look like
Name Subset Value
A A 14
A AB 187
A AC 5465
X X 455
X XB 4545
X XC 854
Y Y 45
Y YB 98
Y YC 4566
PS Actual dataset can be of many columns/rows.
According to your definition, column with subset 'S' or 'L' should be in the output, as it matches one single character.
This code should work:
# Create a list of allowed regex pattern. Easy to extend
allow = [
'.',
'.B',
'.C'
]
# Put your pattern into (), so called regex match groups
allow = [f'({a})' for a in allow]
# Use the Pipe operator, to check, if one of the pattern matches
allow = '|'.join(allow)
# Now select only lines, where 'Subset' matches the pattern.
df = df.loc[df.Subset.str.match(f'^({allow})$')]
print(df)
Try this:
new_df = df.groupby('Name').filter(lambda group: group['Subset'].str.match(r'^[a-zA-Z0-9]([BC])?$').all())
Output:
>>> new_df
Name Subset Value
0 A A 14
1 A AB 187
2 A AC 5465
6 X X 455
7 X XB 4545
8 X XC 854
9 Y Y 45
10 Y YB 98
11 Y YC 4566
Explanation:
Name
, so the first group will contain all A
rows, etc..all()
) values of the Subset
column match the regular expression ^[a-zA-Z0-9]([BC])?$
(which read: one of a
through z
, or A
through Z
, or 0
through 9
at the beginning of the string ( ^
); and then optionally one of B
or C
, at the end of the string ( $
))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.