简体   繁体   English

如何消除HTML标签?

[英]How to eliminate html tags?

I am getting the first paragraph from pages and trying to extract words suitable to be tags or keywords. 我正在从页面获取第一段,并尝试提取适合用作标签或关键字的单词。 In some paragraphs there are links and I want to remove the tags: 在某些段落中有链接,我想删除标签:

For instance if the text is 例如,如果文本是

A <b>hex triplet</b> is a six-digit, three-<a href="/wiki/Byte"
enter code heretitle="Byte">byte</a> ...

I want to remove 我要删除

<b></b><a href="/wiki/Byte" title="Byte"></a>

to end up with this 以这个结束

A hex triplet is a six-digit, three-byte ...

A regex like this does not work: 这样的正则表达式不起作用:

>>> text = """A <b>hex triplet</b> is a six-digit, three-<a href="/wiki/Byte"
    enter code heretitle="Byte">byte</a> ..."""
>>> f = re.findall(r'<.+>', text)
>>> f
['<b>hex triplet</b>', '</a>']
>>>

What is the best way to do this? 做这个的最好方式是什么?

I found several similar questions but none of them I think solves this particular problem. 我发现了几个类似的问题,但我认为这些问题都无法解决。

Update with an example of BeautifulSoup extract (extract deletes the tag including its text and must run for each tag separately: 使用BeautifulSoup提取示例进行更新(提取会删除包含其文本的标签,并且必须针对每个标签分别运行:

>>> soup = BeautifulSoup(text)
>>> [s.extract() for s in soup('b')]
[<b>hex triplet</b>]
>>> soup
A  is a six-digit, three-<a href="/wiki/Byte" enter code heretitle="Byte">byte</a> ...
>>> [s.extract() for s in soup('a')]
[<a href="/wiki/Byte" enter code heretitle="Byte">byte</a>]
>>> soup
A  is a six-digit, three- ...
>>> 

Update 更新

For people with the same question: as mentioned by Brendan Long, this answer using HtmlParser works best. 对于有相同问题的人:正如Brendan Long所提到的,使用HtmlParser的答案最有效。

Beautiful Soup is the answer to your problem! 美丽的汤是您的问题的答案! Try it out, it's pretty awesome! 试试看,它非常棒!

Html parsing would become so easy once you use it. HTML解析一旦使用便变得如此简单。

>>> text = """A <b>hex triplet</b> is a six-digit, three-<a href="/wiki/Byte"
... enter code heretitle="Byte">byte</a> ..."""
>>> soup = BeautifulSoup(text)
>>> ''.join(soup.findAll(text=True))
u'A hex triplet is a six-digit, three-byte ...'

If you have all your text that you want to extract enclosed in some outer tags like <body> ... </body> or some <div id="X"> .... </div> , then you can do the following (this illustration assumes that all the text you want to extract is enclosed within the <body> tag). 如果您要提取的所有文本都包含在某些外部标记(如<body> ... </body>或某些<div id="X"> .... </div> ,则可以以下内容(此插图假定您要提取的所有文本都包含在<body>标记内)。 Now you can selectively extract text from only some desired tags. 现在,您可以仅从某些所需标签中选择性地提取文本。

(Look at the documentation and examples and you will find many ways of parsing the DOM) (查看文档和示例,您将发现解析DOM的许多方法)

>>> text = """<body>A <b>hex triplet</b> is a six-digit, 
... three-<a href="/wiki/Byte"
... enter code heretitle="Byte">byte</a>
... </body>"""
>>> soup = BeautifulSoup(text)
>>> ''.join(soup.body.findAll(text=True))
u'A hex triplet is a six-digit, three-byte'

The + quantifier is greedy, meaning it will find the longest possible match. +量词是贪婪的,表示它将找到最长的匹配项。 Add a ? 添加一个? to force it to find the shortest possible match: 强制其找到最短的匹配项:

>>> re.findall(r'<.+?>', text)
['<b>', '</b>', '</a>']

Another way to write the regex is to explicitly exclude right angle brackets inside a tag, using [^>] instead of . 编写正则表达式的另一种方法是使用[^>]代替来显式排除标签内的直角括号. .

>>> re.findall(r'<[^>]+>', text)
['<b>', '</b>', '<a href="/wiki/Byte"\n    enter code heretitle="Byte">', '</a>']

An advantage of this approach is that it will also match newlines ( \\n ). 这种方法的优点是它还将匹配换行符( \\n )。 You can get the same behavior with . 您可以使用获得相同的行为. if you add the re.DOTALL flag. 如果添加re.DOTALL标志。

>>> re.findall(r'<.+?>', text, re.DOTALL)
['<b>', '</b>', '<a href="/wiki/Byte"\n    enter code heretitle="Byte">', '</a>']

To strip out the tags, use re.sub : 要剥离标签,请使用re.sub

>>> re.sub(r'<.+?>', '', text, flags=re.DOTALL)
'A hex triplet is a six-digit, three-byte ...'

This is just the basic elements to strip tags. 这只是剥离标签的基本元素。 Including missing elements, 包括缺失的元素,
the \\w's below represent qualified unicode tag names with prefix and body, 下面的\\ w代表合格的unicode标签名称,带有前缀和正文,
that need a join() statement to form the subexpression. 需要一个join()语句来形成子表达式。 The virtue of parsing 解析的优点
html/xml with regex is it won't fail on the first ill-formed instance, which 带有正则表达式的html / xml是不会在第一个格式错误的实例上失败,
makes it perfect for fixing it! 使其完美修复! The vice is that its slow as sh*t, especially 缺点是它的速度慢,特别是
with unicode. 与unicode。

Unfortunately, stripping tags destroys content since by definition, markup formats content. 不幸的是,剥离标签会破坏内容,因为根据定义,标记会格式化内容。

Try this on a big web page. 在大网页上尝试一下。 This should be translatable into python. 这应该可以翻译成python。

$rx_expanded = '
<
(?:
    (?:
       (?:
           (?:script|style) \s*
         | (?:script|style) \s+ (?:".*?"|\'.*?\'|[^>]*?)+\s*
       )> .*? </(?:script|style)\s*
    )
  |
    (?:
        /?\w+\s*/?
      | \w+\s+ (?:".*?"|\'.*?\'|[^>]*?)+\s*/?
      | !(?:DOCTYPE.*?|--.*?--)
    )
)
>
';

$html =~ s/$rx_expanded/[was]/xsg;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM