简体   繁体   English

使用Python请求和lxml从HTML列表中提取文本/数字

[英]Extracting text/numbers from HTML list using Python requests and lxml

I am trying to extract the 'Seller rank' from items on amazon using Python requests and lxml. 我正在尝试使用Python请求和lxml从亚马逊上的项目中提取“卖方排名”。 So: 所以:

<li id="SalesRank">
<b>Amazon Bestsellers Rank:</b> 

957,875 in Books (<a href="http://www.amazon.co.uk/gp/bestsellers/books/ref=pd_dp_ts_b_1">See Top 100 in Books</a>)

from this example, 957875 is the number I want to extract. 从这个示例中,我要提取的数字是957875。

(Please note, the actual HTML has about 100 blank lines between 'Amazon Bestsellers Rank:' and '957875'. Unsure if this is effecting my result.) (请注意,实际的HTML在“ Amazon畅销书排名:”和“ 957875”之间大约有100行空白。不确定这是否会影响我的结果。)

My current Python code is set up like so: 我当前的Python代码设置如下:

import re
import requests
from lxml import html

page = requests.get('http://www.amazon.co.uk/Lakeland-Expanding-Together-Compartments-Organiser/dp/B00A7Q77GM/ref=sr_1_1?s=kitchen&ie=UTF8&qid=1452504370&sr=1-1-spons&psc=1')
tree = html.fromstring(page.content)
salesrank = tree.xpath('//li[@id="SalesRank"]/text()')
print 'Sales Rank:', salesrank

and the printed output is Sales Rank: [] 并且打印的输出是Sales Rank: []

I was expecting to receive the full list data including all the blank lines of which I would later parse. 我期望收到完整的列表数据,包括以后将解析的所有空白行。 Am I correct in assuming that /text() is not the correct use in this instance and I need to put something else? 我是否假定/ text()在此实例中使用不正确,是否需要添加其他内容? Any help is greatly appreciated. 任何帮助是极大的赞赏。

You are getting an empty list because in one call of the url you are not getting the complete data of the web page. 您将获得一个空列表,因为在一次url调用中您没有获得该网页的完整数据。 For that you have to stream through the url and get all the data in small chunks. 为此,您必须流经URL,并以小块的形式获取所有数据。 And then find out the required in the non-empty chunk. 然后在非空块中找到所需的内容。 The code for the following is :- 以下代码是:

import requests as rq
import re
from bs4 import BeautifulSoup as bs
r=rq.get('http://www.amazon.in/gp/product/0007950306/ref=s9_al_bw_g14_i1?pf_rd_m=A1VBAL9TL5WCBF&pf_rd_s=merchandised-search-3&pf_rd_r=1XBKB22RGT2HBKH4K2NP&pf_rd_t=101&pf_rd_p=798805127&pf_rd_i=4143742031',stream=True)

for chunk in r.iter_content(chunk_size=1024):
    if chunk:
        data = chunk
        soup=bs(data)
        elem=soup.find_all('li',attrs={'id':'SalesRank'})
        if elem!=[]:
            s=re.findall('#[\d+,*]*\sin',str(elem[0]))
            print s[0].split()[0]
            break

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM