[英]Scrape websites with infinite scrolling using selenium and beautifulsoup return repeated elements
So i have script which uses Selenium and BeautifulSoup to scrape this website: ' http://m.1688.com/page/offerlist.htmlspm=a26g8.7664812.0.0.R19GYe&memberId=zhtiezhi&sortType=tradenumdown ' 所以我有使用Selenium和BeautifulSoup抓取此网站的脚本:' http : //m.1688.com/page/offerlist.htmlspm=a26g8.7664812.0.0.R19GYe&memberId=zhtiezhi&sortType=tradenumdown '
But my script keep printing the first 8 elements of the page and disregard the contents appeared when scrolling. 但是我的脚本继续打印页面的前8个元素,而忽略了滚动时出现的内容。 This is the script:
这是脚本:
# -*- coding: utf-8 -*-
from urllib import urlopen
from bs4 import BeautifulSoup as BS
import unicodecsv as ucsv
import re
from selenium import webdriver
import time
with open('list1.csv','wb') as f:
w = ucsv.writer(f, encoding='utf-8-sig')
driver =
webdriver.Chrome('C:\Users\V\Desktop\PY\web_scrape\chromedriver.exe')
base_url = 'http://m.1688.com/page/offerlist.html?
spm=a26g8.7664812.0.0.R19GYe&memberId=zhtiezhi&sortType=tradenumdown'
driver.get(base_url)
pageSource = driver.page_source
lst = []
for n in range(10):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
soup = BS(pageSource, 'lxml')
container = soup.find('div', {'class' : 'container'})
items = container.findAll('div', {'class' : 'item-inner'})
for item in items:
title = item.find('div', {'class' : 'item-price'}).text
title_ = ''.join(i for i in title if ord(i) < 128 if i != '\n')
lst.append(title_)
print lst
time.sleep(5)
The output for each scroll is: 每个滚动的输出为:
[u'0.21', u'0.45', u'1.10', u'3.60', u'2.20', u'6.80', u'1.40', u'3.00']
[u'0.21', u'0.45', u'1.10', u'3.60', u'2.20', u'6.80', u'1.40', u'3.00', u'0.21', u'0.45', u'1.10', u'3.60', u'2.20', u'6.80', u'1.40', u'3.00']
[u'0.21', u'0.45', u'1.10', u'3.60', u'2.20', u'6.80', u'1.40', u'3.00', u'0.21', u'0.45', u'1.10', u'3.60', u'2.20', u'6.80', u'1.40', u'3.00', u'0.21', u'0.45', u'1.10', u'3.60', u'2.20', u'6.80', u'1.40', u'3.00']
The first scroll the list has 8 elements, the second scroll the list has 16 elements, the extra 8 elements is repeated from the first scroll. 列表中的第一个滚动具有8个元素,列表的第二个滚动具有16个元素,从第一个滚动中重复了另外8个元素。 Same thing happens for the rest scrolls.
其余的滚动也发生同样的事情。 So the script only return 8 elements even when i use selenium to scroll the site but i want it to print out all elements while scrolling.
因此,即使我使用硒滚动站点,该脚本也只返回8个元素,但我希望它在滚动时打印出所有元素。 I would really appreciate it if you guys give me some advices.
如果你们给我一些建议,我将不胜感激。
The problem is in this part: 问题出在这部分:
items = container.findAll('div', {'class' : 'item-inner'})
for item in items:
title = item.find('div', {'class' : 'item-price'}).text
title_ = ''.join(i for i in title if ord(i) < 128 if i != '\n')
lst.append(title_)
Each time you "scroll" the items
object becomes one block bigger because when you scroll, the upper content doesn't go away. 每次“滚动”
items
对象都会大一个块,因为滚动时,上部内容不会消失。 You need to get rid of first n-1
item
s from items
to escape duplication. 您需要从
items
中删除前n-1
item
以免重复。
There are two possibilities: 有两种可能性:
I have found an answer to the problem, by putting the pageSource into the loop and instead of hiding the Chrome in the taskbar, you have to open it or you could use PhantomJS instead of Chrome driver. 通过将pageSource放入循环中,而不是将Chrome隐藏在任务栏中,您必须打开它,或者可以使用PhantomJS而不是Chrome驱动程序来找到问题的答案。
for n in range(10):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2)
pageSource = drive.page_source
soup = BS(pageSource, 'lxml')
container = soup.find('div', {'class' : 'container'})
items = container.findAll('div', {'class' : 'item-inner'})
for item in items:
title = item.find('div', {'class' : 'item-price'}).text
title_ = ''.join(i for i in title if ord(i) < 128 if i != '\n')
lst.append(title_)
print len(lst)
Now the output will change, instead of 现在输出将更改,而不是
8
8
8
8
It will print 它将打印
16
20
28
...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.