简体   繁体   English

python html仅带超链接的文本

[英]python html strip of hyperlinked text only

So I have trying to remove the HTML from 所以我试图从中删除HTML

<a href="/define.php?term=dubstep&defid=5175360">dubstep</a> the music that is created from transformers having s$#

So it reads like so after parsed 所以解析后看起来像这样

dubstep - the music that is created from transformers having S$# dubstep-由具有S $#的变形金刚制作的音乐

I want to extract the text dubstep from this html hyperlink 我想从此html超链接中提取文本dubstep

how would I go about doing this? 我将如何去做呢?

I read the solution over here How to remove tags from a string in python using regular expressions? 我在这里阅读了解决方案如何使用正则表达式从python中的字符串中删除标签? (NOT in HTML) (不是HTML)

but i get 但我明白了

<class 'NameError'>, NameError("name 're' is not defined",), <traceback object at 0x036A41E8>)

well

 NameError("name 're' is not defined",),

means you forgot to import re at the beginning, but this is a guess. 表示您在一开始就忘记了import re ,但这只是一个猜测。

also, since you only need the word between the <a></a> tags, you need a regexp similar to this: 另外,由于只需要<a></a>标记之间的单词,因此需要类似于以下内容的正则表达式:

 .*<a .*>([^<]*)</a>.*

Why not use BeautifulSoup ? 为什么不使用BeautifulSoup

In [44]: from bs4 import  BeautifulSoup

In [45]: soup = BeautifulSoup ('''<a href="/define.php?term=dubstep&defid=5175360">dubstep</a> the music that is created from transformers having s$#''')

In [46]: soup.find('a').text
Out[46]: u'dubstep'

EDIT: 编辑:

Or if you just want text: 或者,如果您只想要文本:

In [48]: soup.text 
Out[48]: u'dubstep the music that is created from transformers having s$#'

Use this: 用这个:

from bs4 import Beautifulsoup
html = <a href="/define.php?term=dubstep&defid=5175360">dubstep</a> the music that is created from transformers having s$#
soup = Beautifulsoup(html)
print(soup.get_text())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM