简体   繁体   English

搜寻需要互动的网站

[英]Scraping Website Needing Interaction

I'm working on a scraping project - looking at what recylcing companies offer for different products in the UK 我正在研究一个刮板项目-看看再生公司为英国的不同产品提供什么

I've run into a problem with this website: 我遇到了这个网站的问题:

http://www.musicmagpie.co.uk/entertainment/ http://www.musicmagpie.co.uk/entertainment/

I have a list of barcodes I want to find their buy price for (enter barcode into search box and hit 'Add button). 我有一个要查找其购买价格的条形码列表(在搜索框中输入条形码,然后单击“添加”按钮)。 I've managed to get a Selenium Webdriver working, but it's a very slow process and I can't run through lots of barcodes without the website crapping out at me and killing my process at some point. 我已经设法使Selenium Webdriver正常工作,但这是一个非常缓慢的过程,并且如果网站没有出现在我面前并扼杀我的过程的情况下,我就无法运行大量条形码。

I'm aiming for about 1 search per sec, at the moment it's taking be about 5+ secs on average. 我的目标是每秒搜索1次,目前平均大约需要5秒以上。 This is the code I'm running: 这是我正在运行的代码:

driver = webdriver.Chrome(r"C:\Users\leonK\Documents\Python Scripts\chromedriver.exe")
driver.get('http://www.musicmagpie.co.uk/start-selling/basket-media')

countx = 0
count = 0
for EAN in EANs:
    countx += 1
    count += 1

    if count % 200 == 0:
        driver.close()
        driver = webdriver.Chrome(r"C:\Users\leonK\Documents\Python Scripts\chromedriver.exe")
        driver.get('http://www.musicmagpie.co.uk/start-selling/basket-media')
        count = 1

    driver.find_element_by_xpath("""//*[@id="txtBarcode"]""").send_keys(str(EAN))

    #If popup window appears, exception will close it as first click will fail.
    try:    
        driver.find_element_by_xpath("""//*[@id="getValSmall"]""").click()
    except:
        driver.find_element_by_xpath("""//*[@id="gform_close"]""").click()

    prodnames = driver.find_elements_by_xpath("""//div[@class='col_Title']""")
    if len(prodnames) == count:
        ProductName.append(prodnames[0].text)
        BuyPrice.append(driver.find_elements_by_xpath("""//div[@class='col_Price']""")[0].text)
    else:
        ProductName.append('nan')
        BuyPrice.append('nan')
        count = len(prodnames)

    elapsed = time.clock()    
    print('MusicMagpieScraper:', EAN, '--', countx, '/', len(EANs), '--', (elapsed - start), 's')

driver.close()

I've got some experience using Urllib and parsing with BeautifulSoup, and would prefer to switch over to that. 我在使用Urllib和使用BeautifulSoup进行解析方面有一些经验,并且希望切换到该方法。 But, I don't know how to extract that data without the webdriver doing the clicks. 但是,我不知道如何在没有网络驱动程序点击的情况下提取数据。

Any advice/tips would be very appriciated! 任何建议/技巧将非常适用!

Added: 添加:

The add button link is: 添加按钮链接为:

__doPostBack('ctl00$ctl00$ctl00$ContentPlaceHolderDefault$mainContent$tabbedMediaVal_10$getValSmall','')

This is the JS function I found: 这是我发现的JS函数:

{name: "__EVENTTARGET", value: ""}
{name: "__EVENTARGUMENT", value: ""}
{name: "__VIEWSTATE", value: "/wEPDwUENTM4MQ9kFgJmD2QWAmYPZBYCZg9kFgJmD2QWBGYPZB…uZSAhaW1wb3J0YW50O2RkQweS+jvDtjK8er7dCKBBRwOWWuE="}
{name: "ctl00$ctl00$ctl00$ContentPlaceHolderDefault$signIn_8$hdn_BasketValue", value: "2"}
{name: "ctl00$ctl00$ctl00$ContentPlaceHolderDefault$mainContent$tabbedMediaVal_10$txtBarcode", value: "5051275026429"}
{name: "ctl00$ctl00$ctl00$ContentPlaceHolderDefault$mainContent$tabbedMediaVal_10$wtmBarcode_ClientState", value: ""}
{name: "ctl00$ctl00$ctl00$ContentPlaceHolderDefault$mainContent$tabbedTechVal_11$txtSearch", value: "Enter item (e.g. iPhone 5)"}
{name: "ctl00$ctl00$ctl00$ContentPlaceHolderDefault$mainContent$tabbedTechVal_11$wmSearch_ClientState", value: ""}
{name: "ctl00$ctl00$ctl00$ContentPlaceHolderDefault$mainContent$LegoVal_12$ddlLego", value: "-999"}
{name: "ctl00$ctl00$ctl00$ContentPlaceHolderDefault$mainContent$TotalValueBox_14$txtPromoVoucher_sm", value: ""}
{name: "ctl00$ctl00$ctl00$ContentPlaceHolderDefault$mainContent$TotalValueBox_14$txtPromoVoucher", value: ""}
{name: "__SCROLLPOSITIONX", value: "0"}
{name: "__SCROLLPOSITIONY", value: "0"}
{name: "hiddenInputToUpdateATBuffer_CommonToolkitScripts", value: "1"}

line 4 is where the barcode is input: 第4行是输入条形码的位置:

{name: "ctl00$ctl00$ctl00$ContentPlaceHolderDefault$mainContent$tabbedMediaVal_10$txtBarcode", value: "5051275026429"}

Hopefully useful info, I don't know where to go from here and google hasn't helped too much 希望能提供有用的信息,但我不知道从这里出发,Google并没有提供太多帮助

I managed to find a solution to this using requests 我设法使用请求找到了解决方案

    get_response = requests.get(url='http://www.musicmagpie.co.uk/start-selling/')
    post_data = {'__EVENTTARGET' : 'ctl00$ctl00$ctl00$ContentPlaceHolderDefault$mainContent$tabbedMediaVal_10$getValSmall',
           '__EVENTARGUMENT' : '',
           'ctl00$ctl00$ctl00$ContentPlaceHolderDefault$mainContent$tabbedMediaVal_10$txtBarcode' : ean}
    # POST some form-encoded data:
    post_response = requests.post(url='http://www.musicmagpie.co.uk/start-selling/', data=post_data)    

    soup = BeautifulSoup(post_response.text, "lxml")

    BuyPrice = soup.find('div', class_='col_Price').text.rstrip()
    ProductName = soup.find('div', class_='col_Title').text.rstrip()

This code sends a dictionary of functions/values (may not be correct terminology!) and it fires back an easy-to-parse response from which I pulled the data I wanted! 这段代码发送了一个函数/值的字典(可能不是正确的术语!),并触发了易于解析的响应,从中提取了我想要的数据!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM