New to python but old to living. I attempting to use multiple regex patterns from a txt file to extract data from a news article, txt file. I have gotten it to a point where I can find matches but not save the extracted data. This is what I have in raw unhygienic nonpythonic script so far. I appreciate all comments as I am self learning.
import re
reg_ex = open('APT1.txt', "r", encoding = 'utf-8-sig')
lines = reg_ex.read()
strip = lines.strip()
reggie = strip.split(';')
reggie_lst = []
match_lst = []
for raw_regex in reggie:
reggie_lst.append(re.compile(raw_regex))
get_string = open("APT.txt", "r", encoding = 'utf-8-sig')
nystring = get_string.read()
if any(compiled_reg.search(nystring) for compiled_reg in reggie_lst):
print("Got some Matches")
You can use re.findall()
to extract your data into a list, instead of just asking if a regex has matched.
import re
reg_ex = open('APT1.txt', "r", encoding='utf-8-sig')
lines = reg_ex.read()
strip = lines.strip()
reggie = strip.split(';')
reggie_lst = []
match_lst = []
for raw_regex in reggie:
reggie_lst.append(raw_regex)
get_string = open("APT.txt", "r", encoding='utf-8-sig')
nystring = get_string.read()
for reg in reggie_lst:
for text_match in re.findall(reg, nystring):
print("Got match for regex {}: {}".format(reg, text_match))
Instead of printing it in the last line you can also save it in a new file, of course. In this example I have also removed compiling the regex only for printing/debugging purposes.
Caution by using parentheses (groups) in your regex. The re.findall()
behaviour is a little bit different to re.search()
or re.match()
. You have to use (?: …
then, see also this post .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.