简体   繁体   English

twitter scraper:我无法获取页面上的所有推文,Selenium

[英]twitter scraper : I can't fetch all the tweets on the page ,Selenium

I'm trying to build a bot that delete tweets with a specific date, so I had to scroll the page to fetch more tweets every time, that's not the problem, the problem is when I'm trying to get the tweets info, it's only fetch the last five tweets, when I try to scroll the page by this line of code我正在尝试构建一个删除具有特定日期的推文的机器人,所以我每次都必须滚动页面以获取更多推文,这不是问题,问题是当我试图获取推文信息时,它是当我尝试通过这行代码滚动页面时,只获取最后五个推文

driver.execute_script("window.scrollTo(0,document.body.scrollHeight);")

it leads to a gap between the tweets date它导致推文日期之间的差距

output example{not a real one}: 1- tweet1 date = 22 jan 23.... 5- tweet5 date = 20 jan 23 6- tweet6 date = 13 dec 22 output 示例{不是真实的}:1- tweet1 日期 = 22 年 1 月 22 日.... 5- tweet5 日期 = 23 年 1 月 20 日 6- tweet6 日期 = 22 年 12 月 13 日

there is a gap between output 5 and 6, output 5和6之间有差距,

How Could I load all the tweets on the page, before scrolling it?在滚动页面之前,我如何加载页面上的所有推文?

# Go to profile page

driver.get("``https://twitter.com/MYACC``")

# load tweets

tweets = driver.find_elements(BY.XPATH, "//article[@data-testid='tweet']")

last_height = driver.execute_script("return document.body.scrollHeight")

while True:

    for tweet in tweets:
        
        print(tweet.text)
         driver.execute_script("window.scrollTo(0,document.body.scrollHeight);")
    
    new_height = driver.execute_script("return document.body.scrollHeight")
    
    time.sleep(5)
    
    tweets = driver.find_elements(BY.XPATH, "//article[@data-testid='tweet']")
    
    if last_height == new_height:
        break

driver.quit()

this is the code这是代码

I tried to scroll the page by 5000 pixel, but the problem is that, its stops on cretin point, and start running from the start..我试图将页面滚动 5000 像素,但问题是,它停在白痴点上,然后从头开始运行。

As we will see in a moment there is no way to load all the tweets on the page.正如我们稍后将看到的,无法加载页面上的所有推文。 The following GIF shows the HTML code of the elements (tweets and other things) contained in the timeline as I scoll down the page.以下 GIF 显示了我向下滚动页面时时间轴中包含的元素(推文和其他内容)的 HTML 代码。 Each element can be identified by its transform style attribute.每个元素都可以通过其transform样式属性来标识。 The first tweet has translateY(0px) , the second tweet has translateY(735.6px) and so on.第一条推文有translateY(0px) ,第二条推文有translateY(735.6px)等等。 As I scroll down you can see that the first tweets disappear from HTML and new ones appear at the bottom.当我向下滚动时,您可以看到第一条推文从 HTML 中消失,新推文出现在底部。

在此处输入图像描述

From the point of view of scraping, scrolling can be messy.从抓取的角度来看,滚动可能会很混乱。 A better and cleaner strategy is to delete each tweet from the HTML after you are done with it.一个更好、更干净的策略是在完成后从 HTML 中删除每条推文。 New tweets will automatically get loaded as if you were scrolling.新推文将自动加载,就好像您正在滚动一样。

driver.get(...)

# wait for tweets to appear in the page
first_tweet = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div[data-testid=cellInnerDiv]")))
# scroll to first tweet
driver.execute_script('arguments[0].scrollIntoView({block: "center", behavior: "smooth"});', first_tweet)
time.sleep(1)

date_for_deletion = '2022-05-21'
stop_condition = 50

for i in range(999):
    tweets = driver.find_elements(By.CSS_SELECTOR, 'div[data-testid=cellInnerDiv]')
    is_a_tweet = tweets[0].find_elements(By.XPATH, './/article[@data-testid="tweet"]')
    
    # check if list is not empty (i.e. check if element is a tweet or a "who to follow" element)
    if is_a_tweet:
        date = tweets[0].find_element(By.XPATH, './/time').get_attribute('datetime').split('T')[0]
        print(date, is_a_tweet[0].text[:90].replace('\n',''))
        if date == date_for_deletion:
            # delete the tweet
            ...
        time.sleep(1)

    # delete element from HTML
    driver.execute_script('var element = arguments[0]; element.remove();', tweets[0])
    if i == stop_condition:
        break

You will get an output such as你会得到一个 output 比如

2022-05-30 Text of tweet 1
2022-05-28 Text of tweet 2
2022-05-27 Text of tweet 3
...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM