简体   繁体   中英

Multiple regex patterns to extract data from article using python

New to python but old to living. I attempting to use multiple regex patterns from a txt file to extract data from a news article, txt file. I have gotten it to a point where I can find matches but not save the extracted data. This is what I have in raw unhygienic nonpythonic script so far. I appreciate all comments as I am self learning.

import re

reg_ex = open('APT1.txt', "r", encoding = 'utf-8-sig')
lines = reg_ex.read()
strip = lines.strip()
reggie = strip.split(';') 


reggie_lst = []
match_lst = []

for raw_regex in reggie:
    reggie_lst.append(re.compile(raw_regex))


get_string = open("APT.txt", "r", encoding = 'utf-8-sig')
nystring = get_string.read()


if any(compiled_reg.search(nystring) for compiled_reg in reggie_lst):
    print("Got some Matches")

You can use re.findall() to extract your data into a list, instead of just asking if a regex has matched.

import re

reg_ex = open('APT1.txt', "r", encoding='utf-8-sig')
lines = reg_ex.read()
strip = lines.strip()
reggie = strip.split(';')

reggie_lst = []
match_lst = []

for raw_regex in reggie:
    reggie_lst.append(raw_regex)

get_string = open("APT.txt", "r", encoding='utf-8-sig')
nystring = get_string.read()


for reg in reggie_lst:
    for text_match in re.findall(reg, nystring):
        print("Got match for regex {}: {}".format(reg, text_match))

Instead of printing it in the last line you can also save it in a new file, of course. In this example I have also removed compiling the regex only for printing/debugging purposes.

Caution by using parentheses (groups) in your regex. The re.findall() behaviour is a little bit different to re.search() or re.match() . You have to use (?: … then, see also this post .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM