简体   繁体   English

使用硒和beautifulsoup进行无限滚动的Scrape网站返回重复的元素

[英]Scrape websites with infinite scrolling using selenium and beautifulsoup return repeated elements

So i have script which uses Selenium and BeautifulSoup to scrape this website: ' http://m.1688.com/page/offerlist.htmlspm=a26g8.7664812.0.0.R19GYe&memberId=zhtiezhi&sortType=tradenumdown ' 所以我有使用Selenium和BeautifulSoup抓取此网站的脚本:' http : //m.1688.com/page/offerlist.htmlspm=a26g8.7664812.0.0.R19GYe&memberId=zhtiezhi&sortType=tradenumdown '

But my script keep printing the first 8 elements of the page and disregard the contents appeared when scrolling. 但是我的脚本继续打印页面的前8个元素,而忽略了滚动时出现的内容。 This is the script: 这是脚本:

# -*- coding: utf-8 -*-
from urllib import urlopen
from bs4 import BeautifulSoup as BS
import unicodecsv as ucsv
import re 
from selenium import webdriver
import time 

with open('list1.csv','wb') as f:
w = ucsv.writer(f, encoding='utf-8-sig')

driver = 
webdriver.Chrome('C:\Users\V\Desktop\PY\web_scrape\chromedriver.exe')
base_url = 'http://m.1688.com/page/offerlist.html?
spm=a26g8.7664812.0.0.R19GYe&memberId=zhtiezhi&sortType=tradenumdown'
driver.get(base_url)
pageSource = driver.page_source
lst = []
for n in range(10): 
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    soup = BS(pageSource, 'lxml')
    container = soup.find('div', {'class' : 'container'})
    items = container.findAll('div', {'class' : 'item-inner'})
    for item in items:
        title = item.find('div', {'class' : 'item-price'}).text
        title_ = ''.join(i for i in title if ord(i) < 128  if i != '\n')
        lst.append(title_)
    print lst
    time.sleep(5)

The output for each scroll is: 每个滚动的输出为:

[u'0.21', u'0.45', u'1.10', u'3.60', u'2.20', u'6.80', u'1.40', u'3.00']
[u'0.21', u'0.45', u'1.10', u'3.60', u'2.20', u'6.80', u'1.40', u'3.00', u'0.21', u'0.45', u'1.10', u'3.60', u'2.20', u'6.80', u'1.40', u'3.00']
[u'0.21', u'0.45', u'1.10', u'3.60', u'2.20', u'6.80', u'1.40', u'3.00', u'0.21', u'0.45', u'1.10', u'3.60', u'2.20', u'6.80', u'1.40', u'3.00', u'0.21', u'0.45', u'1.10', u'3.60', u'2.20', u'6.80', u'1.40', u'3.00']

The first scroll the list has 8 elements, the second scroll the list has 16 elements, the extra 8 elements is repeated from the first scroll. 列表中的第一个滚动具有8个元素,列表的第二个滚动具有16个元素,从第一个滚动中重复了另外8个元素。 Same thing happens for the rest scrolls. 其余的滚动也发生同样的事情。 So the script only return 8 elements even when i use selenium to scroll the site but i want it to print out all elements while scrolling. 因此,即使我使用硒滚动站点,该脚本也只返回8个元素,但我希望它在滚动时打印出所有元素。 I would really appreciate it if you guys give me some advices. 如果你们给我一些建议,我将不胜感激。

The problem is in this part: 问题出在这部分:

items = container.findAll('div', {'class' : 'item-inner'})
    for item in items:
        title = item.find('div', {'class' : 'item-price'}).text
        title_ = ''.join(i for i in title if ord(i) < 128  if i != '\n')
        lst.append(title_)

Each time you "scroll" the items object becomes one block bigger because when you scroll, the upper content doesn't go away. 每次“滚动” items对象都会大一个块,因为滚动时,上部内容不会消失。 You need to get rid of first n-1 item s from items to escape duplication. 您需要从items中删除前n-1 item以免重复。

There are two possibilities: 有两种可能性:

  1. let the infinite scroll finish and then get the data; 让无限滚动结束然后获取数据;
  2. after every content reload, you can compare the data that you already have with the new data and then add it to the list. 每次重新加载内容后,您可以将现有数据与新数据进行比较,然后将其添加到列表中。

I have found an answer to the problem, by putting the pageSource into the loop and instead of hiding the Chrome in the taskbar, you have to open it or you could use PhantomJS instead of Chrome driver. 通过将pageSource放入循环中,而不是将Chrome隐藏在任务栏中,您必须打开它,或者可以使用PhantomJS而不是Chrome驱动程序来找到问题的答案。

for n in range(10):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2)
pageSource = drive.page_source
soup = BS(pageSource, 'lxml')
container = soup.find('div', {'class' : 'container'})
items = container.findAll('div', {'class' : 'item-inner'})
for item in items:
    title = item.find('div', {'class' : 'item-price'}).text
    title_ = ''.join(i for i in title if ord(i) < 128  if i != '\n')
    lst.append(title_)
print len(lst)

Now the output will change, instead of 现在输出将更改,而不是

8
8
8
8

It will print 它将打印

16
20
28
...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM