Python，使用正则表达式消除尖括号内的线条

Question

I'm writing a python script to assign grammatical categories to words in several text files. 我正在编写一个python脚本，为几个文本文件中的单词分配语法类别。 In each text file, I have file headers within angle brackets <>. 在每个文本文件中，在尖括号<>中都有文件头。 Throughout the texts there are also additional lines with information such as time stamps, page numbers, and questions from the transcriber. 在整个文本中，还有其他行，其中包含诸如时间戳，页码和来自抄录员的问题之类的信息。 I want to remove these lines. 我要删除这些行。 This is basically what the text files look like: 文本文件基本上是这样的：

<title      Titipuru Supay>
<speaker    name>
<sex        female>
<dialect    Pastaza>
<register   narrative>
<contributor    name>

chan; payguna serenkya man chiga; 
<ima?> 
payguna kirina man, chiga, mana 
shayachira; ninagunan shi tujsirani nira: 
illaparani nira shi illapay 
<173> 
pasasha, ima shi kasna nin, nisha,

Even though there are the same number of headers in each file the other <> material varies, so I can't just eliminate specific lines. 即使每个文件中的标头数量相同，其他<>材料也有所不同，所以我不能只消除特定的行。 So I thought I'd try something simple like a re.sub statement that removes everything inbetween <> and including the brackets. 所以我想我会尝试一些简单的事情，例如re.sub语句，该语句删除<>之间的所有内容，包括括号。

with open(file, encoding='utf-8') as file_in:
        text = file_in.read()
        re.sub(r"<.*>", " ", text)

I tried <.*> on pythex.org and regex101 it worked in both places with a test string, but not in my script (yes I have import re). 我在pythex.org和regex101上尝试了<。*>，它在两个地方都可以使用测试字符串，但在我的脚本中却没有（是的，我已经导入了）。 I also tried other solutions like: \\<.*\\> 我还尝试了其他解决方案，例如： \\<.*\\>

Am I just not getting the regex right or there something deeper here? 我只是没有正确使用正则表达式，还是在这里更深入？

Answer 1

Strings are immutable , meaning they cannot be modified, only reassigned. 字符串是不可变的 ，这意味着它们不能被修改，只能重新分配。 The re.sub(...) is working, but it's returning a new string. re.sub(...)正在工作，但是正在返回一个新字符串。 Try this: 尝试这个：

text = re.sub(r"<.*>", " ", text)

If this still doesn't work, please give us more information about your problem 如果仍然无法解决问题，请向我们提供有关您的问题的更多信息

Answer 2

From what I understand, you may have several <...> on the same line. 据我了解，您可能在同一行上有多个<...> 。 In this case, you are much safer with a negated character class solution: 在这种情况下，使用否定的字符类解决方案会更安全：

text = re.sub(r"<[^>]*>", " ", text)

The text variable, of course, should be updated as Python strings are immutable, and the regex is now matching < , then zero or more characters other than > , and then > . 当然，应将text变量更新为Python字符串是不可变的，并且正则表达式现在匹配< ，然后匹配零个或多个除>之外的字符 ，然后匹配> 。

See the regex demo 见正则表达式演示

正则表达式可视化

Python，使用正则表达式消除尖括号内的线条

问题描述

2 个解决方案

解决方案1
4 2016-06-15 18:07:49

解决方案2
1 已采纳 2016-06-15 19:15:19

Python，使用正则表达式消除尖括号内的线条

问题描述

2 个解决方案

解决方案1 4 2016-06-15 18:07:49

解决方案2 1 已采纳 2016-06-15 19:15:19

解决方案1
4 2016-06-15 18:07:49

解决方案2
1 已采纳 2016-06-15 19:15:19