简体   繁体   English

Instagram web 刮与 selenium Python 问题

[英]Instagram web scraping with selenium Python problem

I have a problem with scraping all pictures from Instagram profile, I'm scrolling the page till bottom then find all "a" tags finally always I get only last 30 links to pictures.我从 Instagram 个人资料中抓取所有图片时遇到问题,我将页面滚动到底部,然后找到所有“a”标签,最后总是我只得到最后 30 个图片链接。 I think that driver doesn't see full content of page.我认为该驱动程序看不到页面的全部内容。

#scroll
scrolldown = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var scrolldown=document.body.scrollHeight;return scrolldown;")
match=False
while(match==False):
    last_count = scrolldown
    time.sleep(2)
    scrolldown = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var scrolldown=document.body.scrollHeight;return scrolldown;")
    if last_count==scrolldown:
        match=True

#posts
posts = []
time.sleep(2)
links = driver.find_elements_by_tag_name('a')
time.sleep(2)
for link in links:
    post = link.get_attribute('href')
    if '/p/' in post:
        posts.append(post)

Looks like you first scrolling to the page bottom and only then getting the links instead of getting the links and treating them inside the scrolling loop.看起来您首先滚动到页面底部,然后才获取链接,而不是获取链接并在滚动循环中处理它们。
So, if you want to get all the links you should perform the因此,如果您想获取所有链接,您应该执行

links = driver.find_elements_by_tag_name('a')
time.sleep(2)
for link in links:
    post = link.get_attribute('href')
    if '/p/' in post:
        posts.append(post)

inside the scrolling, also before the first scrolling.在滚动内部,也在第一次滚动之前。
Something like this:像这样的东西:

def get_links():
    time.sleep(2)
    links = driver.find_elements_by_tag_name('a')
    time.sleep(2)
    for link in links:
        post = link.get_attribute('href')
        if '/p/' in post:
            posts.add(post)

posts = set()
get_links()
scrolldown = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var scrolldown=document.body.scrollHeight;return scrolldown;")
match=False
while(match==False):
    get_links()
    last_count = scrolldown
    time.sleep(2)
    scrolldown = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var scrolldown=document.body.scrollHeight;return scrolldown;")
    if last_count==scrolldown:
        match=True

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM