How to filter a list by regex patterns that are stored in another list and calculate statistics on the number of regex matches?

Question

I have a two lists and I want to match my first list that contains a list of regex patterns with my list of values. In addition, count how many times the values match with the regex. Finally I want to send those statistics to a new dataframe.

Here is a breakdown:

List 1:

regex_list = ['Error: Look ','Parking Charge Notice', '^Follow Up$']

List 2:

value_list = ['Follow Up','abc123','abc123', 'Error: Look', 'Follow Up']

I want the new dataframe's output to look like:

pattern, count
'Error: Look', 1
'^Follow Up$', 2
'Parking Charge Notice': 0

As you can see, my new dataframe displays the value that matched from list 1 and how many times it matched in list 2.

Here is my python so far:

import re 
regex_list = ['Error: Look ', 'Parking Charge Notice', '^Follow Up$']
value_list = ['Follow Up', 'abc123', 'abc123', 'Error: Look', 'Follow Up']


p = re.compile(r'^Follow Up$')
matches = p.findall(value_list)

Here is my output:

Traceback (most recent call last):
  File "C:/Users/e136320/PycharmProjects/scrape_imsva_v2/working/regex_test.py", line 35, in <module>
    matches = p.findall(value_list)
TypeError: expected string or bytes-like object

I receive an error shown above. Is there a way to automatically loop through my regex list and filter out my value_list for instances and then put the patter and its count in a dataframe?

I know my code isn't much but I am new to python and dataframes so I am completely lost so any ideas or suggestions would help.

Answer 1

You can try following code:

import re
import pandas as pd 

regex_list = ['Error: Look', 'Parking Charge Notice', '^Follow Up$']
value_list = ['Follow Up', 'abc123', 'abc123', 'Error: Look', 'Follow Up']
df = pd.DataFrame()

for j in regex_list:
  p = re.compile(j)
  for i in value_list:
    matches = p.findall(i)
    if len(matches)!=0:
      df = df.append({'regex':j,'matched':matches},ignore_index = True)

print(df)
count=df.groupby('regex')['matched'].count().reset_index()
count.columns = ['regex','count']
print(count)

Based on the error message that you posted you are passing a list of values to findall which is causing the issue.

How to filter a list by regex patterns that are stored in another list and calculate statistics on the number of regex matches?

Question

1 answers

solution1
1 2020-11-19 03:54:11

How to filter a list by regex patterns that are stored in another list and calculate statistics on the number of regex matches?

Question

1 answers

solution1 1 2020-11-19 03:54:11

solution1
1 2020-11-19 03:54:11