[英]python html strip of hyperlinked text only
So I have trying to remove the HTML from 所以我试图从中删除HTML
<a href="/define.php?term=dubstep&defid=5175360">dubstep</a> the music that is created from transformers having s$#
So it reads like so after parsed 所以解析后看起来像这样
dubstep - the music that is created from transformers having S$#
dubstep-由具有S $#的变形金刚制作的音乐
I want to extract the text dubstep
from this html hyperlink 我想从此html超链接中提取文本
dubstep
how would I go about doing this? 我将如何去做呢?
I read the solution over here How to remove tags from a string in python using regular expressions? 我在这里阅读了解决方案如何使用正则表达式从python中的字符串中删除标签? (NOT in HTML)
(不是HTML)
but i get 但我明白了
<class 'NameError'>, NameError("name 're' is not defined",), <traceback object at 0x036A41E8>)
well 好
NameError("name 're' is not defined",),
means you forgot to import re
at the beginning, but this is a guess. 表示您在一开始就忘记了
import re
,但这只是一个猜测。
also, since you only need the word between the <a></a>
tags, you need a regexp similar to this: 另外,由于只需要
<a></a>
标记之间的单词,因此需要类似于以下内容的正则表达式:
.*<a .*>([^<]*)</a>.*
Why not use BeautifulSoup
? 为什么不使用
BeautifulSoup
?
In [44]: from bs4 import BeautifulSoup
In [45]: soup = BeautifulSoup ('''<a href="/define.php?term=dubstep&defid=5175360">dubstep</a> the music that is created from transformers having s$#''')
In [46]: soup.find('a').text
Out[46]: u'dubstep'
EDIT: 编辑:
Or if you just want text: 或者,如果您只想要文本:
In [48]: soup.text
Out[48]: u'dubstep the music that is created from transformers having s$#'
Use this: 用这个:
from bs4 import Beautifulsoup
html = <a href="/define.php?term=dubstep&defid=5175360">dubstep</a> the music that is created from transformers having s$#
soup = Beautifulsoup(html)
print(soup.get_text())
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.