简体   繁体   English

如何使用正则表达式从文本文件中提取带标签的文本?

[英]How to extract tagged text from a text file using a regex?

For a class, I need to extract everything that comes between the element <seg> ... </seg> and I'm trying to do this via Python instead of wasting hours doing this by hand (it's well over 400 lines). 对于一个类,我需要提取元素<seg> ... </seg>之间的所有内容,而我正在尝试通过Python来完成此任务,而不是浪费大量的时间手动完成(这远远超过400行)。 What I have right now in ways of code is this (a code that I found online and changed a little so that it doesn't print the line number): 我现在在代码方面所拥有的是以下代码(我在网上找到了一个代码,并对其进行了一些更改,以使其不会打印行号):

import re                           
err_occur = [] 
pattern = re.compile(r"<seg>(.*)</seg>")
try:
    with open ('corpus.txt', 'rt') as in_file:
        for linenum, line in enumerate(in_file):
            if pattern.search(line) != None:
                err_occur.append((linenum, line.rstrip('\n')))
        for linenum, line in err_occur:
            print(line, sep='')
except FileNotFoundError:
    print("Input file not found.")

The only problem I have with this is that it prints the <seg> and </seg> in the results, which I don't want to happen. 我唯一的问题是它在结果中打印<seg></seg> ,我不想发生这种情况。 I've tried to create groups (which you can see in my usage of parentheses in the pattern variable) but I have no idea how to manipulate the code to return just group 1 (I've tried many different ways). 我尝试创建组(您可以在pattern变量中使用括号的方式看到),但是我不知道如何操作代码以仅返回组1(我尝试了许多不同的方法)。

You need to use positive lookbehind and a positive lookahead . 您需要使用positive lookbehindpositive lookahead The <seg> and <\\seg> in your regex consumes some text so you see them in your results, but the lookahead and lookbehind just checks if </seg> and <seg> are there respectively without consuming any characters. 正则表达式中的<seg><\\seg>会消耗一些文本,因此您会在结果中看到它们,但是lookahead和lookbehind只是分别检查</seg><seg>是否存在,而不消耗任何字符。 It only matches those strings. 它只匹配那些字符串。

Tl;dr: lookahead and lookbehind matches string in <seg>string</seg> and not the tags. Tl; dr: lookahead和lookbehind匹配<seg>string</seg>中的字符串 ,而不是标签。

So your regex should be like (?<=<seg>).*(?=</seg>) , this should be fine. 因此,您的正则表达式应类似于(?<=<seg>).*(?=</seg>) ,这应该很好。

There is some documantation here 有一些documantation 这里

Here's something that will print all the tagged text in each line without the tags: 这是将在每行中打印不带标签的所有带标签文本的内容:

The important modification was changing your regex from (r"<seg>(.*)</seg>" to r"<seg>(.*?)</seg>" —note the added ? after the * . This is called making it "non-greedy" so it doesn't match as much of the remaining text as possible (the default "greedy" mode). This is discussed in greater detail in the Regular Expression HOWTO section of Python's online documentation. 重要的修改是将您的正则表达式从(r"<seg>(.*)</seg>"更改为r"<seg>(.*?)</seg>"注意在*之后添加了? 。称为将其设为“非贪婪”,因此它与其余文本尽可能匹配(默认为“贪婪”模式),这在Python的在线文档的“ 正则表达式HOWTO”部分中进行了更详细的讨论。

Another significant change, regex-wise, was to use pattern.findall() instead of pattern.search() . 正则表达式方面的另一个重大更改是使用pattern.findall() 而不是 pattern.search()

I also removed all the parts of the code dealing with line numbers since you mentioned you weren't interest in that information. 我还删除了代码中处理行号的所有部分,因为您提到您对该信息不感兴趣。

import re

err_occur = []
pattern = re.compile(r"<seg>(.*?)</seg>")
input_filename = 'corpus.txt'

try:
    with open(input_filename, 'rt') as in_file:
        for line in in_file:
            matches = pattern.findall(line)
            if matches:
                for match in matches:
                    err_occur.append(match)
except FileNotFoundError:
    print("Input file %r not found." % input_filename)

for tagged in err_occur:
    print(tagged)

You can use BeautifulSoup for this. 您可以为此使用BeautifulSoup。

soup = BeautifulSoup(your input)
print soup.findAll("seg")[0].renderContents()

Also the regex can be: 正则表达式也可以是:

import re
print re.findall("<seg>(.*?)</seg>", your input)

http://tpcg.io/B6h2So http://tpcg.io/B6h2So

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM