简体   繁体   English

Python,使用正则表达式消除尖括号内的线条

[英]Python, eliminating lines within angle brackets with regex

I'm writing a python script to assign grammatical categories to words in several text files. 我正在编写一个python脚本,为几个文本文件中的单词分配语法类别。 In each text file, I have file headers within angle brackets <>. 在每个文本文件中,在尖括号<>中都有文件头。 Throughout the texts there are also additional lines with information such as time stamps, page numbers, and questions from the transcriber. 在整个文本中,还有其他行,其中包含诸如时间戳,页码和来自抄录员的问题之类的信息。 I want to remove these lines. 我要删除这些行。 This is basically what the text files look like: 文本文件基本上是这样的:

<title      Titipuru Supay>
<speaker    name>
<sex        female>
<dialect    Pastaza>
<register   narrative>
<contributor    name>

chan; payguna serenkya man chiga; 
<ima?> 
payguna kirina man, chiga, mana 
shayachira; ninagunan shi tujsirani nira: 
illaparani nira shi illapay 
<173> 
pasasha, ima shi kasna nin, nisha,

Even though there are the same number of headers in each file the other <> material varies, so I can't just eliminate specific lines. 即使每个文件中的标头数量相同,其他<>材料也有所不同,所以我不能只消除特定的行。 So I thought I'd try something simple like a re.sub statement that removes everything inbetween <> and including the brackets. 所以我想我会尝试一些简单的事情,例如re.sub语句,该语句删除<>之间的所有内容,包括括号。

with open(file, encoding='utf-8') as file_in:
        text = file_in.read()
        re.sub(r"<.*>", " ", text)

I tried <.*> on pythex.org and regex101 it worked in both places with a test string, but not in my script (yes I have import re). 我在pythex.org和regex101上尝试了<。*>,它在两个地方都可以使用测试字符串,但在我的脚本中却没有(是的,我已经导入了)。 I also tried other solutions like: \\<.*\\> 我还尝试了其他解决方案,例如: \\<.*\\>

Am I just not getting the regex right or there something deeper here? 我只是没有正确使用正则表达式,还是在这里更深入?

Strings are immutable , meaning they cannot be modified, only reassigned. 字符串是不可变的 ,这意味着它们不能被修改,只能重新分配。 The re.sub(...) is working, but it's returning a new string. re.sub(...)正在工作,但是正在返回一个字符串。 Try this: 尝试这个:

text = re.sub(r"<.*>", " ", text)

If this still doesn't work, please give us more information about your problem 如果仍然无法解决问题,请向我们提供有关您的问题的更多信息

From what I understand, you may have several <...> on the same line. 据我了解,您可能在同一行上有多个<...> In this case, you are much safer with a negated character class solution: 在这种情况下,使用否定的字符类解决方案会更安全:

text = re.sub(r"<[^>]*>", " ", text)

The text variable, of course, should be updated as Python strings are immutable, and the regex is now matching < , then zero or more characters other than > , and then > . 当然,应将text变量更新为Python字符串是不可变的,并且正则表达式现在匹配< ,然后匹配零个或多个除>之外的字符 ,然后匹配>

See the regex demo 正则表达式演示

正则表达式可视化

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM