[英]How to delete the words between two delimiters?
I have a noisy data..something like 我有一个嘈杂的数据......类似的东西
<@ """@$ FSDF >something something <more noise>
Now I just want to extract "something something"
. 现在我只想提取
"something something"
。 Is there a way on how to delete the text between those two delimiters "<"
and ">"
? 有没有办法如何删除这两个分隔符
"<"
和">"
之间的文本?
Use regular expressions : 使用正则表达式 :
>>> import re
>>> s = '<@ """@$ FSDF >something something <more noise>'
>>> re.sub('<[^>]+>', '', s)
'something something '
[Update] [更新]
If you tried a pattern like <.+>
, where the dot means any character and the plus sign means one or more, you know it does not work. 如果您尝试了类似
<.+>
的模式,其中点表示任何字符,加号表示一个或多个,您知道它不起作用。
>>> re.sub(r'<.+>', s, '')
''
Why!?! 为什么!?! It happens because regular expressions are "greedy" by default.
这是因为正则表达式默认是“贪婪的”。 The expression will match anything until the end of the string, including the
>
- and this is not what we want. 表达式将匹配任何内容,直到字符串结束,包括
>
- 这不是我们想要的。 We want to match <
and stop on the next >
, so we use the [^x]
pattern which means "any character but x" (x being >
). 我们想匹配
<
并停在下一个>
,所以我们使用[^x]
模式,这意味着“任何字符,但x”(x为>
)。
The ?
的
?
operator turns the match "non-greedy", so this has the same effect: 运算符将匹配“非贪婪”,因此具有相同的效果:
>>> re.sub(r'<.+?>', '', s)
'something something '
The previous is more explicit, this one is less typing; 前一个更明确,这个更少打字; be aware that
x?
请注意
x?
means zero or one occurrence of x. 表示零或一次出现x。
Of course, you can use regular expressions. 当然,您可以使用正则表达式。
import re
s = #your string here
t = re.sub('<.*?>', '', s)
The above code should do it. 上面的代码应该这样做。
First thank you Paulo Scardine, I used your re to do great thing. 首先,谢谢Paulo Scardine,我用你的方法来做伟大的事情。 The idea was to have tag free LibreOffice po file for printing purposes.
我们的想法是使用免费的LibreOffice po文件进行打印。 And I made the following script which will clean the help file for smaller and easier ones.
我制作了以下脚本,它将清理帮助文件,以便更小更简单。
import re
f = open('a.csv')
text = f.read()
f.close()
clean = re.sub('<[^>]+>', ' ', text)
f = open('b.csv', 'w')
f.write(clean)
f.close()
import re
my_str = '<@ """@$ FSDF >something something <more noise>'
re.sub('<.*?>', '', my_str)
'something something '
The re.sub
function takes a regular expresion and replace all the matches in the string with the second parameter. re.sub
函数采用常规表达式,并使用第二个参数替换字符串中的所有匹配项。 In this case, we are searching for all characters between <
and >
( '<.*?>'
) and replacing them with nothing ( ''
). 在这种情况下,我们正在搜索
<
和>
( '<.*?>'
)之间的所有字符,并将其替换为空( ''
)。
The ?
的
?
is used in re
for non-greedy searches. 在使用
re
用于非贪婪的搜索。
More about the re module . 有关re模块的更多信息。
If that "noises" are actually html tags, I suggest you to look into BeautifulSoup 如果那个“噪音”实际上是html标签,我建议你去看看BeautifulSoup
Just for interest, you could write some code such as: 只是为了兴趣,您可以编写一些代码,例如:
with open('blah.txt','w') as f:
f.write("""<sdgsa>one<as<>asfd<asdf>
<asdf>two<asjkdgai><iasj>three<fasdlojk>""")
def filter_line(line):
count=0
ignore=False
result=[]
for c in line:
if c==">" and count==1:
count=0
ignore=False
if not ignore:
result.append(c)
if c=="<" and count==0:
ignore=True
count=1
return "".join(result)
with open('blah.txt') as f:
print "".join(map(filter_line,f.readlines()))
>>>
<>one<>asfd<>
<>two<><>three<>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.