[英]Get data between two tags in Python
<h3>
<a href="article.jsp?tp=&arnumber=16">
Granular computing based
<span class="snippet">data</span>
<span class="snippet">mining</span>
in the views of rough set and fuzzy set
</a>
</h3>
Using Python I want to get the values from the anchor tag which should be Granular computing based data mining in the views of rough set and fuzzy set 使用Python,我想从anchor标签和模糊集视图中的anchor标签中获取值,该标签应该是基于粒度计算的数据挖掘
I tried using lxml 我尝试使用lxml
parser = etree.HTMLParser()
tree = etree.parse(StringIO.StringIO(html), parser)
xpath1 = "//h3/a/child::text() | //h3/a/span/child::text()"
rawResponse = tree.xpath(xpath1)
print rawResponse
and getting the following output 并获得以下输出
['\r\n\t\t','\r\n\t\t\t\t\t\t\t\t\tgranular computing based','data','mining','in the view of roughset and fuzzyset\r\n\t\t\t\t\t\t\]
You could use the text_content
method: 您可以使用text_content
方法:
import lxml.html as LH
html = '''<h3>
<a href="article.jsp?tp=&arnumber=16">
Granular computing based
<span class="snippet">data</span>
<span class="snippet">mining</span>
in the views of rough set and fuzzy set
</a>
</h3>'''
root = LH.fromstring(html)
for elt in root.xpath('//a'):
print(elt.text_content())
yields 产量
Granular computing based
data
mining
in the views of rough set and fuzzy set
or, to remove whitespace, you could use 或者,要删除空格,您可以使用
print(' '.join(elt.text_content().split()))
to obtain 获得
Granular computing based data mining in the views of rough set and fuzzy set
Here is another option which you might find useful: 这是您可能会发现有用的另一个选项:
print(' '.join([elt.strip() for elt in root.xpath('//a/descendant-or-self::text()')]))
yields 产量
Granular computing based data mining in the views of rough set and fuzzy set
(Note it leaves an extra space between data
and mining
however.) (但是请注意,它在data
和mining
之间留有额外的空间。)
'//a/descendant-or-self::text()'
is a more generalized version of "//a/child::text() | //a/span/child::text()"
. '//a/descendant-or-self::text()'
是"//a/child::text() | //a/span/child::text()"
通用版本。 It will iterate through all children and grandchildren, etc. 它将遍历所有子孙等。
With BeautifulSoup
: 随着BeautifulSoup
:
>>> from bs4 import BeautifulSoup
>>> html = (the html you posted above)
>>> soup = BeautifulSoup(html)
>>> print " ".join(soup.h3.text.split())
Granular computing based data mining in the views of rough set and fuzzy set
Explanation: 说明:
BeautifulSoup
parses the HTML, making it easily accessible. BeautifulSoup
解析HTML,使其易于访问。 soup.h3
accesses the h3
tag in the HTML. soup.h3
访问HTML中的h3
标签。
.text
, simply, gets everything from the h3
tag, excluding all the other tags such as the span
s. 简而言之, .text
从h3
标记中获取所有内容,但不包括所有其他标记,例如span
。
I use split()
here to get rid of the excess whitespace and newlines, then " ".join()
as the split function returns a list. 我在这里使用split()
来消除多余的空格和换行符,然后在split函数返回列表时使用" ".join()
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.