[英]How to extract number more gracefully in Python using xpath and regular expression
I have a small html snippet from which I want to extract just a number – actually a grade. 我有一个小的html片段,我想从中提取一个数字 - 实际上是一个等级。 I am using Python with
scrapy
and re
. 我正在使用Python与
scrapy
和re
。
My code works, but is far from being nice. 我的代码有效,但远非好看。
Here is the html snippet, from which I just want to get the 2
. 这是html片段,我只想从中获取
2
。
<div id="left">
<div class="0"><b>Certificate:</b></div>
<div class="1">
<div></div>
<div>
<a class="link" href="new.html">Maths</a> (First) Grade 2<br>
</div>
</div>
<div class="2"></div>
</div>
And here is how I solved it so far: 以下是我到目前为止解决的问题:
! note = sel.xpath('//*[@id="left"]/div[2]/div[2]/text()[2]').extract()
! print note
> [u'\xa0(First)\xa0\xa0\xa0Grade 2']
! note_string = ''.join(note)
! note_only = re.search(r'\d+', note_string).group()
> 2
It's certainly not best practice to transform lists to strings to extract such tiny information. 将列表转换为字符串以提取如此微小的信息当然不是最佳做法。
How can I do better? 我怎么能做得更好?
您可以使用以下XPath表达式来获取2
substring-after(//*[@id="left"]/div[2]/div[2]/text(), "Grade ")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.