如何删除两个分隔符之间的单词？

Question

I have a noisy data..something like 我有一个嘈杂的数据......类似的东西

<@ """@$ FSDF >something something <more noise>

Now I just want to extract "something something" . 现在我只想提取"something something" 。 Is there a way on how to delete the text between those two delimiters "<" and ">" ? 有没有办法如何删除这两个分隔符"<"和">"之间的文本？

Answer 1

Use regular expressions : 使用正则表达式：

>>> import re
>>> s = '<@ """@$ FSDF >something something <more noise>'
>>> re.sub('<[^>]+>', '', s)
'something something '

[Update] [更新]

If you tried a pattern like <.+> , where the dot means any character and the plus sign means one or more, you know it does not work. 如果您尝试了类似<.+>的模式，其中点表示任何字符，加号表示一个或多个，您知道它不起作用。

>>> re.sub(r'<.+>', s, '')
''

Why!?! 为什么！？！ It happens because regular expressions are "greedy" by default. 这是因为正则表达式默认是“贪婪的”。 The expression will match anything until the end of the string, including the > - and this is not what we want. 表达式将匹配任何内容，直到字符串结束，包括> - 这不是我们想要的。 We want to match < and stop on the next > , so we use the [^x] pattern which means "any character but x" (x being > ). 我们想匹配<并停在下一个> ，所以我们使用[^x]模式，这意味着“任何字符，但x”（x为> ）。

The ? 的? operator turns the match "non-greedy", so this has the same effect: 运算符将匹配“非贪婪”，因此具有相同的效果：

>>> re.sub(r'<.+?>', '', s)
'something something '

The previous is more explicit, this one is less typing; 前一个更明确，这个更少打字; be aware that x? 请注意x? means zero or one occurrence of x. 表示零或一次出现x。

Answer 2

Of course, you can use regular expressions. 当然，您可以使用正则表达式。

import re
s = #your string here
t = re.sub('<.*?>', '', s)

The above code should do it. 上面的代码应该这样做。

Answer 3

First thank you Paulo Scardine, I used your re to do great thing. 首先，谢谢Paulo Scardine，我用你的方法来做伟大的事情。 The idea was to have tag free LibreOffice po file for printing purposes. 我们的想法是使用免费的LibreOffice po文件进行打印。 And I made the following script which will clean the help file for smaller and easier ones. 我制作了以下脚本，它将清理帮助文件，以便更小更简单。

import re
f = open('a.csv')
text = f.read()
f.close()

clean = re.sub('<[^>]+>', ' ', text)

f = open('b.csv', 'w')
f.write(clean)
f.close()

Answer 4

import re
my_str = '<@ """@$ FSDF >something something <more noise>'
re.sub('<.*?>', '', my_str)
'something something '

The re.sub function takes a regular expresion and replace all the matches in the string with the second parameter. re.sub函数采用常规表达式，并使用第二个参数替换字符串中的所有匹配项。 In this case, we are searching for all characters between < and > ( '<.*?>' ) and replacing them with nothing ( '' ). 在这种情况下，我们正在搜索<和> （ '<.*?>' ）之间的所有字符，并将其替换为空（ '' ）。

The ? 的? is used in re for non-greedy searches. 在使用re用于非贪婪的搜索。

More about the re module . 有关re模块的更多信息。

If that "noises" are actually html tags, I suggest you to look into BeautifulSoup 如果那个“噪音”实际上是html标签，我建议你去看看BeautifulSoup

Answer 5

Just for interest, you could write some code such as: 只是为了兴趣，您可以编写一些代码，例如：

with open('blah.txt','w') as f:
    f.write("""<sdgsa>one<as<>asfd<asdf>
<asdf>two<asjkdgai><iasj>three<fasdlojk>""")

def filter_line(line):
    count=0
    ignore=False
    result=[]
    for c in line:
        if c==">" and count==1:
            count=0
            ignore=False
        if not ignore:
            result.append(c)
        if c=="<" and count==0:
            ignore=True
            count=1
    return "".join(result)

with open('blah.txt') as f:
    print "".join(map(filter_line,f.readlines()))

>>> 
<>one<>asfd<>
<>two<><>three<>

如何删除两个分隔符之间的单词？

问题描述

5 个解决方案

解决方案1
52 已采纳 2012-01-09 05:55:25

解决方案2
14 2012-01-09 05:56:05

解决方案3
5 2013-01-19 19:03:24

解决方案4
3 2012-01-09 05:57:33

解决方案5
1 2012-01-09 06:07:17

如何删除两个分隔符之间的单词？

问题描述

5 个解决方案

解决方案1 52 已采纳 2012-01-09 05:55:25

解决方案2 14 2012-01-09 05:56:05

解决方案3 5 2013-01-19 19:03:24

解决方案4 3 2012-01-09 05:57:33

解决方案5 1 2012-01-09 06:07:17

解决方案1
52 已采纳 2012-01-09 05:55:25

解决方案2
14 2012-01-09 05:56:05

解决方案3
5 2013-01-19 19:03:24

解决方案4
3 2012-01-09 05:57:33

解决方案5
1 2012-01-09 06:07:17