Multiple regex patterns to extract data from article using python

Question

New to python but old to living. I attempting to use multiple regex patterns from a txt file to extract data from a news article, txt file. I have gotten it to a point where I can find matches but not save the extracted data. This is what I have in raw unhygienic nonpythonic script so far. I appreciate all comments as I am self learning.

import re

reg_ex = open('APT1.txt', "r", encoding = 'utf-8-sig')
lines = reg_ex.read()
strip = lines.strip()
reggie = strip.split(';') 


reggie_lst = []
match_lst = []

for raw_regex in reggie:
    reggie_lst.append(re.compile(raw_regex))


get_string = open("APT.txt", "r", encoding = 'utf-8-sig')
nystring = get_string.read()


if any(compiled_reg.search(nystring) for compiled_reg in reggie_lst):
    print("Got some Matches")

Answer 1

You can use re.findall() to extract your data into a list, instead of just asking if a regex has matched.

import re

reg_ex = open('APT1.txt', "r", encoding='utf-8-sig')
lines = reg_ex.read()
strip = lines.strip()
reggie = strip.split(';')

reggie_lst = []
match_lst = []

for raw_regex in reggie:
    reggie_lst.append(raw_regex)

get_string = open("APT.txt", "r", encoding='utf-8-sig')
nystring = get_string.read()


for reg in reggie_lst:
    for text_match in re.findall(reg, nystring):
        print("Got match for regex {}: {}".format(reg, text_match))

Instead of printing it in the last line you can also save it in a new file, of course. In this example I have also removed compiling the regex only for printing/debugging purposes.

Caution by using parentheses (groups) in your regex. The re.findall() behaviour is a little bit different to re.search() or re.match() . You have to use (?: … then, see also this post .

Multiple regex patterns to extract data from article using python

Question

1 answers

solution1
1 ACCPTED 2018-10-03 00:39:46

Multiple regex patterns to extract data from article using python

Question

1 answers

solution1 1 ACCPTED 2018-10-03 00:39:46

solution1
1 ACCPTED 2018-10-03 00:39:46