简体   繁体   English

如何删除两个分隔符之间的单词?

[英]How to delete the words between two delimiters?

I have a noisy data..something like 我有一个嘈杂的数据......类似的东西

<@ """@$ FSDF >something something <more noise>

Now I just want to extract "something something" . 现在我只想提取"something something" Is there a way on how to delete the text between those two delimiters "<" and ">" ? 有没有办法如何删除这两个分隔符"<"">"之间的文本?

Use regular expressions : 使用正则表达式

>>> import re
>>> s = '<@ """@$ FSDF >something something <more noise>'
>>> re.sub('<[^>]+>', '', s)
'something something '

[Update] [更新]

If you tried a pattern like <.+> , where the dot means any character and the plus sign means one or more, you know it does not work. 如果您尝试了类似<.+>的模式,其中点表示任何字符,加号表示一个或多个,您知道它不起作用。

>>> re.sub(r'<.+>', s, '')
''

Why!?! 为什么!?! It happens because regular expressions are "greedy" by default. 这是因为正则表达式默认是“贪婪的”。 The expression will match anything until the end of the string, including the > - and this is not what we want. 表达式将匹配任何内容,直到字符串结束,包括> - 这不是我们想要的。 We want to match < and stop on the next > , so we use the [^x] pattern which means "any character but x" (x being > ). 我们想匹配<并停在下一个> ,所以我们使用[^x]模式,这意味着“任何字符,但x”(x为> )。

The ? ? operator turns the match "non-greedy", so this has the same effect: 运算符将匹配“非贪婪”,因此具有相同的效果:

>>> re.sub(r'<.+?>', '', s)
'something something '

The previous is more explicit, this one is less typing; 前一个更明确,这个更少打字; be aware that x? 请注意x? means zero or one occurrence of x. 表示零或一次出现x。

Of course, you can use regular expressions. 当然,您可以使用正则表达式。

import re
s = #your string here
t = re.sub('<.*?>', '', s)

The above code should do it. 上面的代码应该这样做。

First thank you Paulo Scardine, I used your re to do great thing. 首先,谢谢Paulo Scardine,我用你的方法来做伟大的事情。 The idea was to have tag free LibreOffice po file for printing purposes. 我们的想法是使用免费的LibreOffice po文件进行打印。 And I made the following script which will clean the help file for smaller and easier ones. 我制作了以下脚本,它将清理帮助文件,以便更小更简单。

import re
f = open('a.csv')
text = f.read()
f.close()

clean = re.sub('<[^>]+>', ' ', text)

f = open('b.csv', 'w')
f.write(clean)
f.close()
import re
my_str = '<@ """@$ FSDF >something something <more noise>'
re.sub('<.*?>', '', my_str)
'something something '

The re.sub function takes a regular expresion and replace all the matches in the string with the second parameter. re.sub函数采用常规表达式,并使用第二个参数替换字符串中的所有匹配项。 In this case, we are searching for all characters between < and > ( '<.*?>' ) and replacing them with nothing ( '' ). 在这种情况下,我们正在搜索<>'<.*?>' )之间的所有字符,并将其替换为空( '' )。

The ? ? is used in re for non-greedy searches. 在使用re用于非贪婪的搜索。

More about the re module . 有关re模块的更多信息。


If that "noises" are actually html tags, I suggest you to look into BeautifulSoup 如果那个“噪音”实际上是html标签,我建议你去看看BeautifulSoup

Just for interest, you could write some code such as: 只是为了兴趣,您可以编写一些代码,例如:

with open('blah.txt','w') as f:
    f.write("""<sdgsa>one<as<>asfd<asdf>
<asdf>two<asjkdgai><iasj>three<fasdlojk>""")

def filter_line(line):
    count=0
    ignore=False
    result=[]
    for c in line:
        if c==">" and count==1:
            count=0
            ignore=False
        if not ignore:
            result.append(c)
        if c=="<" and count==0:
            ignore=True
            count=1
    return "".join(result)

with open('blah.txt') as f:
    print "".join(map(filter_line,f.readlines()))

>>> 
<>one<>asfd<>
<>two<><>three<>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM