[英]How to extract tagged text from a text file using a regex?
For a class, I need to extract everything that comes between the element <seg>
... </seg>
and I'm trying to do this via Python instead of wasting hours doing this by hand (it's well over 400 lines). 对于一个类,我需要提取元素
<seg>
... </seg>
之间的所有内容,而我正在尝试通过Python来完成此任务,而不是浪费大量的时间手动完成(这远远超过400行)。 What I have right now in ways of code is this (a code that I found online and changed a little so that it doesn't print the line number): 我现在在代码方面所拥有的是以下代码(我在网上找到了一个代码,并对其进行了一些更改,以使其不会打印行号):
import re
err_occur = []
pattern = re.compile(r"<seg>(.*)</seg>")
try:
with open ('corpus.txt', 'rt') as in_file:
for linenum, line in enumerate(in_file):
if pattern.search(line) != None:
err_occur.append((linenum, line.rstrip('\n')))
for linenum, line in err_occur:
print(line, sep='')
except FileNotFoundError:
print("Input file not found.")
The only problem I have with this is that it prints the <seg>
and </seg>
in the results, which I don't want to happen. 我唯一的问题是它在结果中打印
<seg>
和</seg>
,我不想发生这种情况。 I've tried to create groups (which you can see in my usage of parentheses in the pattern variable) but I have no idea how to manipulate the code to return just group 1 (I've tried many different ways). 我尝试创建组(您可以在pattern变量中使用括号的方式看到),但是我不知道如何操作代码以仅返回组1(我尝试了许多不同的方法)。
You need to use positive lookbehind
and a positive lookahead
. 您需要使用
positive lookbehind
和positive lookahead
。 The <seg>
and <\\seg>
in your regex consumes some text so you see them in your results, but the lookahead and lookbehind just checks if </seg>
and <seg>
are there respectively without consuming any characters. 正则表达式中的
<seg>
和<\\seg>
会消耗一些文本,因此您会在结果中看到它们,但是lookahead和lookbehind只是分别检查</seg>
和<seg>
是否存在,而不消耗任何字符。 It only matches those strings. 它只匹配那些字符串。
Tl;dr: lookahead and lookbehind matches string in <seg>string</seg>
and not the tags. Tl; dr: lookahead和lookbehind匹配
<seg>string</seg>
中的字符串 ,而不是标签。
So your regex should be like (?<=<seg>).*(?=</seg>)
, this should be fine. 因此,您的正则表达式应类似于
(?<=<seg>).*(?=</seg>)
,这应该很好。
Here's something that will print all the tagged text in each line without the tags: 这是将在每行中打印不带标签的所有带标签文本的内容:
The important modification was changing your regex from (r"<seg>(.*)</seg>"
to r"<seg>(.*?)</seg>"
—note the added ?
after the *
. This is called making it "non-greedy" so it doesn't match as much of the remaining text as possible (the default "greedy" mode). This is discussed in greater detail in the Regular Expression HOWTO section of Python's online documentation. 重要的修改是将您的正则表达式从
(r"<seg>(.*)</seg>"
更改为r"<seg>(.*?)</seg>"
注意在*
之后添加了?
。称为将其设为“非贪婪”,因此它与其余文本尽可能不匹配(默认为“贪婪”模式),这在Python的在线文档的“ 正则表达式HOWTO”部分中进行了更详细的讨论。
Another significant change, regex-wise, was to use pattern.findall()
instead of pattern.search()
. 正则表达式方面的另一个重大更改是使用
pattern.findall()
而不是 pattern.search()
。
I also removed all the parts of the code dealing with line numbers since you mentioned you weren't interest in that information. 我还删除了代码中处理行号的所有部分,因为您提到您对该信息不感兴趣。
import re
err_occur = []
pattern = re.compile(r"<seg>(.*?)</seg>")
input_filename = 'corpus.txt'
try:
with open(input_filename, 'rt') as in_file:
for line in in_file:
matches = pattern.findall(line)
if matches:
for match in matches:
err_occur.append(match)
except FileNotFoundError:
print("Input file %r not found." % input_filename)
for tagged in err_occur:
print(tagged)
You can use BeautifulSoup for this. 您可以为此使用BeautifulSoup。
soup = BeautifulSoup(your input)
print soup.findAll("seg")[0].renderContents()
Also the regex can be: 正则表达式也可以是:
import re
print re.findall("<seg>(.*?)</seg>", your input)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.