如何消除HTML标签？

Question

I am getting the first paragraph from pages and trying to extract words suitable to be tags or keywords. 我正在从页面获取第一段，并尝试提取适合用作标签或关键字的单词。 In some paragraphs there are links and I want to remove the tags: 在某些段落中有链接，我想删除标签：

For instance if the text is 例如，如果文本是

A <b>hex triplet</b> is a six-digit, three-<a href="/wiki/Byte"
enter code heretitle="Byte">byte</a> ...

I want to remove 我要删除

<b></b><a href="/wiki/Byte" title="Byte"></a>

to end up with this 以这个结束

A hex triplet is a six-digit, three-byte ...

A regex like this does not work: 这样的正则表达式不起作用：

>>> text = """A <b>hex triplet</b> is a six-digit, three-<a href="/wiki/Byte"
    enter code heretitle="Byte">byte</a> ..."""
>>> f = re.findall(r'<.+>', text)
>>> f
['<b>hex triplet</b>', '</a>']
>>>

What is the best way to do this? 做这个的最好方式是什么？

I found several similar questions but none of them I think solves this particular problem. 我发现了几个类似的问题，但我认为这些问题都无法解决。

Update with an example of BeautifulSoup extract (extract deletes the tag including its text and must run for each tag separately: 使用BeautifulSoup提取示例进行更新（提取会删除包含其文本的标签，并且必须针对每个标签分别运行：

>>> soup = BeautifulSoup(text)
>>> [s.extract() for s in soup('b')]
[<b>hex triplet</b>]
>>> soup
A  is a six-digit, three-<a href="/wiki/Byte" enter code heretitle="Byte">byte</a> ...
>>> [s.extract() for s in soup('a')]
[<a href="/wiki/Byte" enter code heretitle="Byte">byte</a>]
>>> soup
A  is a six-digit, three- ...
>>>

Update 更新

For people with the same question: as mentioned by Brendan Long, this answer using HtmlParser works best. 对于有相同问题的人：正如Brendan Long所提到的，使用HtmlParser的答案最有效。

Answer 1

Beautiful Soup is the answer to your problem! 美丽的汤是您的问题的答案！ Try it out, it's pretty awesome! 试试看，它非常棒！

Html parsing would become so easy once you use it. HTML解析一旦使用便变得如此简单。

>>> text = """A <b>hex triplet</b> is a six-digit, three-<a href="/wiki/Byte"
... enter code heretitle="Byte">byte</a> ..."""
>>> soup = BeautifulSoup(text)
>>> ''.join(soup.findAll(text=True))
u'A hex triplet is a six-digit, three-byte ...'

If you have all your text that you want to extract enclosed in some outer tags like <body> ... </body> or some <div id="X"> .... </div> , then you can do the following (this illustration assumes that all the text you want to extract is enclosed within the <body> tag). 如果您要提取的所有文本都包含在某些外部标记（如<body> ... </body>或某些<div id="X"> .... </div> ，则可以以下内容（此插图假定您要提取的所有文本都包含在<body>标记内）。 Now you can selectively extract text from only some desired tags. 现在，您可以仅从某些所需标签中选择性地提取文本。

(Look at the documentation and examples and you will find many ways of parsing the DOM) （查看文档和示例，您将发现解析DOM的许多方法）

>>> text = """<body>A <b>hex triplet</b> is a six-digit, 
... three-<a href="/wiki/Byte"
... enter code heretitle="Byte">byte</a>
... </body>"""
>>> soup = BeautifulSoup(text)
>>> ''.join(soup.body.findAll(text=True))
u'A hex triplet is a six-digit, three-byte'

Answer 2

The + quantifier is greedy, meaning it will find the longest possible match. +量词是贪婪的，表示它将找到最长的匹配项。 Add a ? 添加一个? to force it to find the shortest possible match: 强制其找到最短的匹配项：

>>> re.findall(r'<.+?>', text)
['<b>', '</b>', '</a>']

Another way to write the regex is to explicitly exclude right angle brackets inside a tag, using [^>] instead of . 编写正则表达式的另一种方法是使用[^>]代替来显式排除标签内的直角括号. . 。

>>> re.findall(r'<[^>]+>', text)
['<b>', '</b>', '<a href="/wiki/Byte"\n    enter code heretitle="Byte">', '</a>']

An advantage of this approach is that it will also match newlines ( \\n ). 这种方法的优点是它还将匹配换行符（ \\n ）。 You can get the same behavior with . 您可以使用获得相同的行为. if you add the re.DOTALL flag. 如果添加re.DOTALL标志。

>>> re.findall(r'<.+?>', text, re.DOTALL)
['<b>', '</b>', '<a href="/wiki/Byte"\n    enter code heretitle="Byte">', '</a>']

To strip out the tags, use re.sub : 要剥离标签，请使用re.sub ：

>>> re.sub(r'<.+?>', '', text, flags=re.DOTALL)
'A hex triplet is a six-digit, three-byte ...'

Answer 3

This is just the basic elements to strip tags. 这只是剥离标签的基本元素。 Including missing elements, 包括缺失的元素，
the \\w's below represent qualified unicode tag names with prefix and body, 下面的\\ w代表合格的unicode标签名称，带有前缀和正文，
that need a join() statement to form the subexpression. 需要一个join（）语句来形成子表达式。 The virtue of parsing 解析的优点
html/xml with regex is it won't fail on the first ill-formed instance, which 带有正则表达式的html / xml是不会在第一个格式错误的实例上失败，
makes it perfect for fixing it! 使其完美修复！ The vice is that its slow as sh*t, especially 缺点是它的速度慢，特别是
with unicode. 与unicode。

Unfortunately, stripping tags destroys content since by definition, markup formats content. 不幸的是，剥离标签会破坏内容，因为根据定义，标记会格式化内容。

Try this on a big web page. 在大网页上尝试一下。 This should be translatable into python. 这应该可以翻译成python。

$rx_expanded = '
<
(?:
    (?:
       (?:
           (?:script|style) \s*
         | (?:script|style) \s+ (?:".*?"|\'.*?\'|[^>]*?)+\s*
       )> .*? </(?:script|style)\s*
    )
  |
    (?:
        /?\w+\s*/?
      | \w+\s+ (?:".*?"|\'.*?\'|[^>]*?)+\s*/?
      | !(?:DOCTYPE.*?|--.*?--)
    )
)
>
';

$html =~ s/$rx_expanded/[was]/xsg;

如何消除HTML标签？

问题描述

3 个解决方案

解决方案1
3 2011-10-15 06:33:35

解决方案2
2 已采纳 2011-10-15 04:48:48

解决方案3
0

如何消除HTML标签？

问题描述

3 个解决方案

解决方案1 3 2011-10-15 06:33:35

解决方案2 2 已采纳 2011-10-15 04:48:48

解决方案3 0

解决方案1
3 2011-10-15 06:33:35

解决方案2
2 已采纳 2011-10-15 04:48:48

解决方案3
0