如何使用正则表达式从文本文件中提取带标签的文本？

Question

For a class, I need to extract everything that comes between the element <seg> ... </seg> and I'm trying to do this via Python instead of wasting hours doing this by hand (it's well over 400 lines). 对于一个类，我需要提取元素<seg> ... </seg>之间的所有内容，而我正在尝试通过Python来完成此任务，而不是浪费大量的时间手动完成（这远远超过400行）。 What I have right now in ways of code is this (a code that I found online and changed a little so that it doesn't print the line number): 我现在在代码方面所拥有的是以下代码（我在网上找到了一个代码，并对其进行了一些更改，以使其不会打印行号）：

import re                           
err_occur = [] 
pattern = re.compile(r"<seg>(.*)</seg>")
try:
    with open ('corpus.txt', 'rt') as in_file:
        for linenum, line in enumerate(in_file):
            if pattern.search(line) != None:
                err_occur.append((linenum, line.rstrip('\n')))
        for linenum, line in err_occur:
            print(line, sep='')
except FileNotFoundError:
    print("Input file not found.")

The only problem I have with this is that it prints the <seg> and </seg> in the results, which I don't want to happen. 我唯一的问题是它在结果中打印<seg>和</seg> ，我不想发生这种情况。 I've tried to create groups (which you can see in my usage of parentheses in the pattern variable) but I have no idea how to manipulate the code to return just group 1 (I've tried many different ways). 我尝试创建组（您可以在pattern变量中使用括号的方式看到），但是我不知道如何操作代码以仅返回组1（我尝试了许多不同的方法）。

Answer 1

You need to use positive lookbehind and a positive lookahead . 您需要使用positive lookbehind和positive lookahead 。 The <seg> and <\\seg> in your regex consumes some text so you see them in your results, but the lookahead and lookbehind just checks if </seg> and <seg> are there respectively without consuming any characters. 正则表达式中的<seg>和<\\seg>会消耗一些文本，因此您会在结果中看到它们，但是lookahead和lookbehind只是分别检查</seg>和<seg>是否存在，而不消耗任何字符。 It only matches those strings. 它只匹配那些字符串。

Tl;dr: lookahead and lookbehind matches string in <seg>string</seg> and not the tags. Tl; dr： lookahead和lookbehind匹配<seg>string</seg>中的字符串 ，而不是标签。

So your regex should be like (?<=<seg>).*(?=</seg>) , this should be fine. 因此，您的正则表达式应类似于(?<=<seg>).*(?=</seg>) ，这应该很好。

There is some documantation here 有一些documantation 这里

Answer 2

Here's something that will print all the tagged text in each line without the tags: 这是将在每行中打印不带标签的所有带标签文本的内容：

The important modification was changing your regex from (r"<seg>(.*)</seg>" to r"<seg>(.*?)</seg>" —note the added ? after the * . This is called making it "non-greedy" so it doesn't match as much of the remaining text as possible (the default "greedy" mode). This is discussed in greater detail in the Regular Expression HOWTO section of Python's online documentation. 重要的修改是将您的正则表达式从(r"<seg>(.*)</seg>"更改为r"<seg>(.*?)</seg>"注意在*之后添加了? 。称为将其设为“非贪婪”，因此它与其余文本尽可能不匹配（默认为“贪婪”模式），这在Python的在线文档的“ 正则表达式HOWTO”部分中进行了更详细的讨论。

Another significant change, regex-wise, was to use pattern.findall() instead of pattern.search() . 正则表达式方面的另一个重大更改是使用pattern.findall() 而不是 pattern.search() 。

I also removed all the parts of the code dealing with line numbers since you mentioned you weren't interest in that information. 我还删除了代码中处理行号的所有部分，因为您提到您对该信息不感兴趣。

import re

err_occur = []
pattern = re.compile(r"<seg>(.*?)</seg>")
input_filename = 'corpus.txt'

try:
    with open(input_filename, 'rt') as in_file:
        for line in in_file:
            matches = pattern.findall(line)
            if matches:
                for match in matches:
                    err_occur.append(match)
except FileNotFoundError:
    print("Input file %r not found." % input_filename)

for tagged in err_occur:
    print(tagged)

Answer 3

You can use BeautifulSoup for this. 您可以为此使用BeautifulSoup。

soup = BeautifulSoup(your input)
print soup.findAll("seg")[0].renderContents()

Also the regex can be: 正则表达式也可以是：

import re
print re.findall("<seg>(.*?)</seg>", your input)

http://tpcg.io/B6h2So http://tpcg.io/B6h2So

如何使用正则表达式从文本文件中提取带标签的文本？

问题描述

3 个解决方案

解决方案1
0 2018-12-26 11:21:00

解决方案2
0 已采纳 2018-12-26 12:22:13

解决方案3
0 2018-12-26 13:55:56

如何使用正则表达式从文本文件中提取带标签的文本？

问题描述

3 个解决方案

解决方案1 0 2018-12-26 11:21:00

解决方案2 0 已采纳 2018-12-26 12:22:13

解决方案3 0 2018-12-26 13:55:56

解决方案1
0 2018-12-26 11:21:00

解决方案2
0 已采纳 2018-12-26 12:22:13

解决方案3
0 2018-12-26 13:55:56