简体   繁体   中英

Python Data Cleaning with Pandas

My dataset looks like this

Name    Subset    Value
A       A         14
A       AB        187
A       AC        5465
S       S         454
S       SO        54
S       SH        784
X       X         455
X       XB        4545
X       XC        854
Y       Y         45
Y       YB        98
Y       YC        4566
L       L         78
L       LP        12 
L       LX        655 

I want to keep only those groups of rows whose "subset" columns are of pattern; * , *B , *C (Note: * is any alphabet or digit).

For example, output should look like

Name    Subset    Value
A       A         14
A       AB        187
A       AC        5465
X       X         455
X       XB        4545
X       XC        854
Y       Y         45
Y       YB        98
Y       YC        4566

PS Actual dataset can be of many columns/rows.

According to your definition, column with subset 'S' or 'L' should be in the output, as it matches one single character.

This code should work:

# Create a list of allowed regex pattern. Easy to extend
allow = [
    '.', 
    '.B', 
    '.C'
]

# Put your pattern into (), so called regex match groups
allow = [f'({a})' for a in allow]

# Use the Pipe operator, to check, if one of the pattern matches
allow = '|'.join(allow)

# Now select only lines, where 'Subset' matches the pattern.
df = df.loc[df.Subset.str.match(f'^({allow})$')]
print(df)

Try this:

new_df = df.groupby('Name').filter(lambda group: group['Subset'].str.match(r'^[a-zA-Z0-9]([BC])?$').all())

Output:

>>> new_df
   Name Subset  Value
0     A      A     14
1     A     AB    187
2     A     AC   5465
6     X      X    455
7     X     XB   4545
8     X     XC    854
9     Y      Y     45
10    Y     YB     98
11    Y     YC   4566

Explanation:

  1. Split the dataframe into groups per each unique Name , so the first group will contain all A rows, etc.
  2. Keep only groups where ALL (indicated by .all() ) values of the Subset column match the regular expression ^[a-zA-Z0-9]([BC])?$ (which read: one of a through z , or A through Z , or 0 through 9 at the beginning of the string ( ^ ); and then optionally one of B or C , at the end of the string ( $ ))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM