简体   繁体   English

Selenium 无法在 Instagram 上获取所有元素

[英]Selenium Can't Get All Element on Instagram

I'm creating a script using Python selenium for scraping instagram user post.我正在使用 Python selenium 创建一个脚本来抓取 Instagram 用户帖子。 if user have a 62 post, I want get all of 62 post.如果用户有 62 个帖子,我想获得所有 62 个帖子。

I tried to scroll down until all post loaded and get element/post using xpath and its works.我尝试向下滚动直到所有帖子加载并使用 xpath 及其作品获取元素/帖子。 but only 29 element/post, not all of 62 element/post.但只有 29 个元素/帖子,而不是全部 62 个元素/帖子。

    driver.get("https://instagram.com/celmirashop/")

    #scroll until all post loaded
    scroll()
    wait = WebDriverWait(driver, 15)
    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.eLAPa")))

    time.sleep(30)

    #getting list cards of posts
    list_cards = driver.find_elements_by_xpath("//*[@class='v1Nh3 kIKUG  _bz0w']")
    print(len(list_cards))

if user have 62 post, I want get element of 62 (all) post如果用户有 62 个帖子,我想获得 62 个(全部)帖子的元素

when scrolling instagram, will show new 12 image, but the instagram will remove 12 passed images.滚动 instagram 时,将显示新的 12 张图片,但 instagram 将删除 12 张通过的图片。 I found the solution by saving 12 image when scrolling (every sroll down).我通过在滚动时(每次向下滚动)保存 12 个图像找到了解决方案。 so before instagram remove the passed 12 image, I have saved that images on variabel所以在Instagram删除传递的12张图片之前,我已经将这些图片保存在variabel上

driver.get("https://instagram.com/celmirashop/")


semua_url_lengkap = []
semua_url_post = []
nomor=1
for i in range(50):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    print(nomor)
    nomor+=1
    #mendapatkan list tiap cards update status
    article = driver.find_element_by_tag_name("article")
    list_cards = article.find_elements_by_tag_name("a")

    for item in list_cards:

        url_lengkap=item.get_attribute("href")
        semua_url_lengkap.append(url_lengkap)

        segmen = url_lengkap.rsplit('/', 2)
        semua_url_post.append(segmen[1])


print(len(semua_url_post))
print(semua_url_post)

They design the application in such a way it's hard to scrape.他们设计应用程序的方式很难被抓取。 The elements are lazy loaded so as you scroll, some elements might disappear too.元素是延迟加载的,因此当您滚动时,某些元素也可能会消失。

I'd say use an xpath generic and unchanging like //a//img because they will change the class names to something random again.我会说使用 xpath 通用且不变,例如//a//img ,因为它们会将 class 名称再次更改为随机名称。

Also since you already have a method to scroll, start at the beginning.另外,既然您已经有了滚动的方法,请从头开始。 Log all elements and scroll some more and log again and scrape some more.记录所有元素并再滚动一些,然后再次记录并刮掉一些。 Put on a loop till you find the end of the page element like //footer .循环,直到找到页面元素的结尾,例如//footer

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM