在Python中的两个标签之间获取数据

Question

<h3>
<a href="article.jsp?tp=&arnumber=16">
Granular computing based
<span class="snippet">data</span>
<span class="snippet">mining</span>
in the views of rough set and fuzzy set
</a>
</h3>

Using Python I want to get the values from the anchor tag which should be Granular computing based data mining in the views of rough set and fuzzy set 使用Python，我想从anchor标签和模糊集视图中的anchor标签中获取值，该标签应该是基于粒度计算的数据挖掘

I tried using lxml 我尝试使用lxml

parser = etree.HTMLParser()
tree   = etree.parse(StringIO.StringIO(html), parser)                   
xpath1 = "//h3/a/child::text() | //h3/a/span/child::text()"
rawResponse = tree.xpath(xpath1)              
print rawResponse

and getting the following output 并获得以下输出

['\r\n\t\t','\r\n\t\t\t\t\t\t\t\t\tgranular computing based','data','mining','in the view of roughset and fuzzyset\r\n\t\t\t\t\t\t\]

Answer 1

You could use the text_content method: 您可以使用text_content方法：

import lxml.html as LH

html = '''<h3>
<a href="article.jsp?tp=&arnumber=16">
Granular computing based
<span class="snippet">data</span>
<span class="snippet">mining</span>
in the views of rough set and fuzzy set
</a>
</h3>'''

root = LH.fromstring(html)
for elt in root.xpath('//a'):
    print(elt.text_content())

yields 产量

Granular computing based
data
mining
in the views of rough set and fuzzy set

or, to remove whitespace, you could use 或者，要删除空格，您可以使用

print(' '.join(elt.text_content().split()))

to obtain 获得

Granular computing based data mining in the views of rough set and fuzzy set

Here is another option which you might find useful: 这是您可能会发现有用的另一个选项：

print(' '.join([elt.strip() for elt in root.xpath('//a/descendant-or-self::text()')]))

yields 产量

Granular computing based data  mining in the views of rough set and fuzzy set

(Note it leaves an extra space between data and mining however.) （但是请注意，它在data和mining之间留有额外的空间。）

'//a/descendant-or-self::text()' is a more generalized version of "//a/child::text() | //a/span/child::text()" . '//a/descendant-or-self::text()'是"//a/child::text() | //a/span/child::text()"通用版本。 It will iterate through all children and grandchildren, etc. 它将遍历所有子孙等。

Answer 2

With BeautifulSoup : 随着BeautifulSoup ：

>>> from bs4 import BeautifulSoup
>>> html = (the html you posted above)
>>> soup = BeautifulSoup(html)
>>> print " ".join(soup.h3.text.split())
Granular computing based data mining in the views of rough set and fuzzy set

Explanation: 说明：

BeautifulSoup parses the HTML, making it easily accessible. BeautifulSoup解析HTML，使其易于访问。 soup.h3 accesses the h3 tag in the HTML. soup.h3访问HTML中的h3标签。

.text , simply, gets everything from the h3 tag, excluding all the other tags such as the span s. 简而言之， .text从h3标记中获取所有内容，但不包括所有其他标记，例如span 。

I use split() here to get rid of the excess whitespace and newlines, then " ".join() as the split function returns a list. 我在这里使用split()来消除多余的空格和换行符，然后在split函数返回列表时使用" ".join() 。

在Python中的两个标签之间获取数据

问题描述

2 个解决方案

解决方案1
3 已采纳 2013-05-26 11:27:11

解决方案2
1 2013-05-26 11:16:26

在Python中的两个标签之间获取数据

问题描述

2 个解决方案

解决方案1 3 已采纳 2013-05-26 11:27:11

解决方案2 1 2013-05-26 11:16:26

解决方案1
3 已采纳 2013-05-26 11:27:11

解决方案2
1 2013-05-26 11:16:26