python html仅带超链接的文本

Question

So I have trying to remove the HTML from 所以我试图从中删除HTML

<a href="/define.php?term=dubstep&defid=5175360">dubstep</a> the music that is created from transformers having s$#

So it reads like so after parsed 所以解析后看起来像这样

dubstep - the music that is created from transformers having S$# dubstep-由具有S $＃的变形金刚制作的音乐

I want to extract the text dubstep from this html hyperlink 我想从此html超链接中提取文本dubstep

how would I go about doing this? 我将如何去做呢？

I read the solution over here How to remove tags from a string in python using regular expressions? 我在这里阅读了解决方案如何使用正则表达式从python中的字符串中删除标签？ (NOT in HTML) （不是HTML）

but i get 但我明白了

<class 'NameError'>, NameError("name 're' is not defined",), <traceback object at 0x036A41E8>)

Answer 1

well 好

 NameError("name 're' is not defined",),

means you forgot to import re at the beginning, but this is a guess. 表示您在一开始就忘记了import re ，但这只是一个猜测。

also, since you only need the word between the <a></a> tags, you need a regexp similar to this: 另外，由于只需要<a></a>标记之间的单词，因此需要类似于以下内容的正则表达式：

 .*<a .*>([^<]*)</a>.*

Answer 2

Why not use BeautifulSoup ? 为什么不使用BeautifulSoup ？

In [44]: from bs4 import  BeautifulSoup

In [45]: soup = BeautifulSoup ('''<a href="/define.php?term=dubstep&defid=5175360">dubstep</a> the music that is created from transformers having s$#''')

In [46]: soup.find('a').text
Out[46]: u'dubstep'

EDIT: 编辑：

Or if you just want text: 或者，如果您只想要文本：

In [48]: soup.text 
Out[48]: u'dubstep the music that is created from transformers having s$#'

Answer 3

Use this: 用这个：

from bs4 import Beautifulsoup
html = <a href="/define.php?term=dubstep&defid=5175360">dubstep</a> the music that is created from transformers having s$#
soup = Beautifulsoup(html)
print(soup.get_text())

python html仅带超链接的文本

问题描述

3 个解决方案

解决方案1
0 2014-05-24 20:15:56

解决方案2
0 已采纳 2014-05-24 20:23:38

解决方案3
0 2014-05-24 20:35:27

python html仅带超链接的文本

问题描述

3 个解决方案

解决方案1 0 2014-05-24 20:15:56

解决方案2 0 已采纳 2014-05-24 20:23:38

解决方案3 0 2014-05-24 20:35:27

解决方案1
0 2014-05-24 20:15:56

解决方案2
0 已采纳 2014-05-24 20:23:38

解决方案3
0 2014-05-24 20:35:27