简体   繁体   中英

Creating a list of regex matches from a data frame in Python and then counting the number of matches

I'm trying to

-match the number of times a dollar/cent/percent appears in a column in a data frame -create a function that lists all of these matches -create a function that count the number of matches

I'm able to successfully use regex to find the matches, but I'm unable to get a list to appear and count them. I tried using str.extract() to list out all of the matches and str.contains().sum() to count the number of matches. However, I can't get these to work at the same time and I'm not sure if this is even the right approach.

Here's what I have so far:

(?:^|\s)(\$?(?:\d[ ,.]?)+%?)(?=\s|$) 

This will match any dollar/cent or percent, and accounts for a bunch of different typos that I'm trying to include so the regex string is fine.

import re
import pandas as pd

sample_csv = pd.DataFrame({'SMS_text': ['$1', '214-233-4455'],
                           'SMS_number': ['1', '1'],
                           'Amount': ['12.12%', '$10.12'],
                           'Percent': ['10%', '1']})

sample_csv.Amount.str.extract(r'(\$?(?:\d[ ,.]?)+%?)(?=\s|$)')

This will list out the matches from the Amount column.

0    12.12%
1    $10.12
Name: Amount, dtype: object

And if I substitute str.extract() with sample_csv.Amount.str.contains(r'(\\$?(?:\\d[ ,.]?)+%?)(?=\\s|$)').sum() it will tell me how many matches there are (ie 2) but again, I can't get them to work at the same time.

Is this the right approach or am I setting this up wrong? Seems like maybe a findall() or maybe an re.compile() might work better, but I'm not sure how to get these to work properly.

My desired output from above would be something like:

regex_number_matches = ['12.12%', '$10.12']
regex_number_matches_count = 2

If you want total number of times % or $ appears in your dataframe, try

sample_csv.stack().str.contains('%|\$').sum()

It returns 4

If you want all the instance of the cells in which those appear, try

sample_csv[sample_csv.stack().str.contains('%|\$').unstack()]

And you get

    Amount  Percent SMS_number  SMS_text
0   12.12%  10%     NaN         $1
1   $10.12  NaN     NaN         NaN

It seems (to me) you want to find those values in the Amount column that can't be converted to numbers:

Assuming you have the following DF:

In [107]: sample_csv
Out[107]:
   Amount Percent  SMS_number      SMS_text
0  12.12%     10%           1            $1
1  $10.12       1           1  214-233-4455
2    1.23       2           1  214-233-4455
3    33,5       3           1  214-233-4455

first create a boolean mask of those rows where Amount can't be converted to numeric value:

In [108]: mask = pd.to_numeric(sample_csv['Amount'], errors='coerce').isnull()

In [109]: mask
Out[109]:
0     True
1     True
2    False
3     True
Name: Amount, dtype: bool

now you can easily filter DF, using this mask:

In [110]: sample_csv.loc[mask, 'Amount']
Out[110]:
0    12.12%
1    $10.12
3      33,5
Name: Amount, dtype: object

In [111]: sample_csv.loc[mask, 'Amount'].tolist()
Out[111]: ['12.12%', '$10.12', '33,5']

In [112]: len(sample_csv.loc[mask, 'Amount'])
Out[112]: 3

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM