簡體   English   中英

打印硒文本變量

[英]Print out selenium text variable

我有一個從 Twitter 頁面提取數據的函數,但是當腳本完成時,我沒有收到任何輸出。 該函數旨在從推文中輸出各種信息。 我只是想打印出頁面上的第二條推文。

卡片定義

功能

def get_tweet_data(card):


    
    username - card.find_element_by_xpath(".//span").text   
    handle = card.find_element_by_xpath('.//span[contains(text(), "@" )]').text #
    
    try: 
        
        postdate = card.find_element_by_xpath('.//time').get_attribute('datetime') 
    except NoSuchElementException:
        return
    
    comment = card.find_element_by_xpath('.//div[2]/div[2]/div[1]').text    
                   
    responding = card.find_element_by_xpath('.//div[2]/div[2]/div[1]').text    
    
    text = comment + responding # add the both text fields together
    
    reply_cnt = card.find_element_by_xpath('.//div[@data-testid="reply"]').text
    retweet_cnt = card.find_element_by_xpath('.//div[@data-testid="retweet"]').text
    like_cnt = card.find_element_by_xpath('.//div[@data-testid="like"]').text
   
    tweet = (username, handle, postdate, text, reply_cnt, retweet_cnt, like_cnt)
    return tweet

命令行參數

python twitter.py get_tweet_data(1)

所以,這個花了一段時間; 但是,我能夠為您獲取信息。 當我瀏覽 Twitter 的HTML ,需要 6 個不同的xpath調用

# Count of number of Tweets
(//main[@role='main']//div[@data-testid='primaryColumn']//section[@aria-labelledby='accessible-list-0']//div[contains(@aria-label, 'Timeline:')]//div[contains(@style, 'position: absolute; width: 100%;')]//article[@role='article']//div[@data-testId='tweet'])

# First Card
(//main[@role='main']//div[@data-testid='primaryColumn']//section[@aria-labelledby='accessible-list-0']//div[contains(@aria-label, 'Timeline:')]//div[contains(@style, 'position: absolute; width: 100%;')]//article[@role='article']//div[@data-testId='tweet'])[1]

# Twiter Card Likes, Retweets, Replies
(//main[@role='main']//div[@data-testid='primaryColumn']//section[@aria-labelledby='accessible-list-0']//div[contains(@aria-label, 'Timeline:')]//div[contains(@style, 'position: absolute; width: 100%;')]//article[@role='article']//div[@data-testId='tweet'])[1]//div[contains(@aria-label, 'likes')]

# Twitter's Text Content 
(//main[@role='main']//div[@data-testid='primaryColumn']//section[@aria-labelledby='accessible-list-0']//div[contains(@aria-label, 'Timeline:')]//div[contains(@style, 'position: absolute; width: 100%;')]//article[@role='article']//div[@data-testId='tweet'])[1]//div[@lang]

# Twitter's DateTime
(//main[@role='main']//div[@data-testid='primaryColumn']//section[@aria-labelledby='accessible-list-0']//div[contains(@aria-label, 'Timeline:')]//div[contains(@style, 'position: absolute; width: 100%;')]//article[@role='article']//div[@data-testId='tweet'])[1]//time[@datetime]

# Twitter href is the Twitter Account Poster
((//main[@role='main']//div[@data-testid='primaryColumn']//section[@aria-labelledby='accessible-list-0']//div[contains(@aria-label, 'Timeline:')]//div[contains(@style, 'position: absolute; width: 100%;')]//article[@role='article']//div[@data-testId='tweet'])[1]//a[@role='link'])[1]

一旦我確定了正確的xpath調用,我就創建了一個class來存儲我的數據

class Twitter_Info:
    """This class contains the information regarding to the Twitter Card"""
    CardNumber : int
    Likes : int
    Retweets : int
    Replies : int
    ContentInfo : str
    PostDate : str
    PosterAccount : str
    
    def print_info(self):
        print(f'Card Number: {self.CardNumber}')
        print(f'Poster Account: {self.PosterAccount}')
        print(f'Tweet Date: {self.PostDate}')
        print(f'Likes: {self.Likes}')
        print(f'Replies: {self.Replies}')
        print(f'Retweets: {self.Retweets}')
        print(f'Tweet Content: {self.ContentInfo}')

完成后,我添加了不同的方法來幫助完成手頭的任務

  • wait_for_tweets_to_load
  • number_of_tweets_displayed
  • 滾動到卡片
  • get_card_likes_retweets_replys
  • get_card_text_content
  • get_card_datetime
  • get_card_poster_info

一旦確定了這些,我就可以滾動到每張卡片並抓取數據

主程序 - 供參考

from selenium import webdriver
from selenium.webdriver.chrome.webdriver import WebDriver as ChromeDriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as DriverWait
from selenium.webdriver.support import expected_conditions as DriverConditions
from selenium.common.exceptions import WebDriverException
import time


class Twitter_Info:
    """This class contains the information regarding to the Twitter Card"""
    CardNumber : int
    Likes : int
    Retweets : int
    Replies : int
    ContentInfo : str
    PostDate : str
    PosterAccount : str
    
    def print_info(self):
        print(f'Card Number: {self.CardNumber}')
        print(f'Poster Account: {self.PosterAccount}')
        print(f'Tweet Date: {self.PostDate}')
        print(f'Likes: {self.Likes}')
        print(f'Replies: {self.Replies}')
        print(f'Retweets: {self.Retweets}')
        print(f'Tweet Content: {self.ContentInfo}')


def get_chrome_driver():
    """This sets up our Chrome Driver and returns it as an object"""
    path_to_chrome = "F:\Selenium_Drivers\Windows_Chrome85_Driver\chromedriver.exe"
    chrome_options = webdriver.ChromeOptions() 
    
    # Browser is displayed in a custom window size
    chrome_options.add_argument("window-size=1500,1000")
    
    return webdriver.Chrome(executable_path = path_to_chrome,
                            options = chrome_options)

    
def wait_displayed(driver : ChromeDriver, xpath: str, int = 5):
    try:
         DriverWait(driver, int).until(
            DriverConditions.presence_of_element_located(locator = (By.XPATH, xpath))
        )
    except:
        raise WebDriverException(f'Timeout: Failed to find {xpath}')
    

def is_displayed(driver : ChromeDriver, xpath: str, int = 5):
    try:
         webElement = DriverWait(driver, int).until(
             DriverConditions.presence_of_element_located(locator = (By.XPATH, xpath))
             )
         return True if webElement != None else False
    except:
        return False
    

def scroll_to_element(driver : ChromeDriver, xpath: str, int = 5):
    try:
         webElement = DriverWait(driver, int).until(
             DriverConditions.presence_of_element_located(locator = (By.XPATH, xpath))
             )
         driver.execute_script("arguments[0].scrollIntoView();", webElement)
    except:
        raise WebDriverException(f'Timeout: Failed to find {xpath}\nResult: Failed to Scroll')


def wait_for_tweets_to_load(driver : ChromeDriver):
    if is_displayed(driver, "//main[@role='main']//div[@data-testid='primaryColumn']//div[contains(@aria-label, 'Loading Tweets')]"):
        for counter in range(10):
            if is_displayed(driver, "//main[@role='main']//div[@data-testid='primaryColumn']//div[contains(@aria-label, 'Loading Tweets')]") and counter == 9:
                raise Exception("Page Failed To Load Tweets")
            elif is_displayed(driver, "//main[@role='main']//div[@data-testid='primaryColumn']//div[contains(@aria-label, 'Loading Tweets')]") == False:
                break
            else:
                time.sleep(3)
        

def number_of_tweets_displayed(driver : ChromeDriver):
    """Note: This number will change dynamically when we scroll down on the page ( new Tweets will start loading )"""
    xpath = "{0}{1}{2}".format("(//main[@role='main']//div[@data-testid='primaryColumn']//section[@aria-labelledby='accessible-list-0']",
                               "//div[contains(@aria-label, 'Timeline:')]//div[contains(@style, 'position: absolute; width: 100%;')]",
                               "//article[@role='article']//div[@data-testId='tweet'])")
    return driver.find_elements(By.XPATH, xpath).__len__()


def scroll_to_card(driver : ChromeDriver, card_number : int):
    xpath = "{0}{1}{2}".format("(//main[@role='main']//div[@data-testid='primaryColumn']//section[@aria-labelledby='accessible-list-0']",
                               "//div[contains(@aria-label, 'Timeline:')]//div[contains(@style, 'position: absolute; width: 100%;')]",
                               "//article[@role='article']//div[@data-testId='tweet'])")
    scroll_to_element(driver, xpath = f'{xpath}[{card_number}]')
    
    
def get_card_likes_retweets_replies(driver : ChromeDriver, card_number : int):
    xpath = "{0}{1}{2}".format("(//main[@role='main']//div[@data-testid='primaryColumn']//section[@aria-labelledby='accessible-list-0']",
                               "//div[contains(@aria-label, 'Timeline:')]//div[contains(@style, 'position: absolute; width: 100%;')]",
                               "//article[@role='article']//div[@data-testId='tweet'])")
    xpath = f'{xpath}[{card_number}]//div[contains(@aria-label, "likes")]'
    return driver.find_element(By.XPATH, xpath).get_attribute('aria-label').split(',')


def get_card_text_content(driver : ChromeDriver, card_number : int):
    xpath = "{0}{1}{2}".format("(//main[@role='main']//div[@data-testid='primaryColumn']//section[@aria-labelledby='accessible-list-0']",
                               "//div[contains(@aria-label, 'Timeline:')]//div[contains(@style, 'position: absolute; width: 100%;')]",
                               "//article[@role='article']//div[@data-testId='tweet'])")
    xpath = f'{xpath}[{card_number}]//div[@lang]'
    return driver.find_element(By.XPATH, xpath).text


def get_card_datetime(driver : ChromeDriver, card_number : int):
    xpath = "{0}{1}{2}".format("(//main[@role='main']//div[@data-testid='primaryColumn']//section[@aria-labelledby='accessible-list-0']",
                               "//div[contains(@aria-label, 'Timeline:')]//div[contains(@style, 'position: absolute; width: 100%;')]",
                               "//article[@role='article']//div[@data-testId='tweet'])")
    xpath = f'{xpath}[{card_number}]//time[@datetime]'
    return driver.find_element(By.XPATH, xpath).get_attribute('datetime')


def get_card_poster_info(driver : ChromeDriver, card_number : int):
    xpath = "{0}{1}{2}".format("((//main[@role='main']//div[@data-testid='primaryColumn']//section[@aria-labelledby='accessible-list-0']",
                               "//div[contains(@aria-label, 'Timeline:')]//div[contains(@style, 'position: absolute; width: 100%;')]//article[@role='article']",
                               "//div[@data-testId='tweet'])")
    xpath = f'{xpath}[{card_number}]//a[@role="link"])[1]'
    return driver.find_element(By.XPATH, xpath).get_attribute('href')



# Gets our chrome driver and opens our site
chrome_driver = get_chrome_driver()
chrome_driver.get("https://twitter.com/bbc")
wait_displayed(chrome_driver, "//div[@data-testid='placementTracking']//div[@role='button']//span[text()='Follow']")
wait_displayed(chrome_driver, "//section[@aria-label='Sign up']")
wait_displayed(chrome_driver, "//aside[@aria-label='Who to follow']")
wait_for_tweets_to_load(chrome_driver)

# Get number of Tweets that are displayed
numberOfTweetsDisplayed = number_of_tweets_displayed(chrome_driver)
twitter_cards = []

# Scrape Card Information
for cards in range(numberOfTweetsDisplayed):
    scroll_to_card(chrome_driver, (cards + 1))
    twitter_card = Twitter_Info()
    twitter_card.CardNumber = cards + 1
    
    # Get the Like | Retweet | Replies Info
    raw_info = get_card_likes_retweets_replies(chrome_driver, (cards + 1))
    twitter_card.Replies = raw_info[0].strip().split(' ')[0]
    twitter_card.Retweets = raw_info[1].strip().split(' ')[0]
    twitter_card.Likes = raw_info[2].strip().split(' ')[0]
    
    # Get rest of our data
    twitter_card.ContentInfo = get_card_text_content(chrome_driver, (cards + 1))
    twitter_card.PostDate = get_card_datetime(chrome_driver, (cards + 1))
    twitter_card.PosterAccount = get_card_poster_info(chrome_driver, (cards + 1))
    
    # Display our information and add it to our list
    twitter_card.print_info()
    twitter_cards.append(twitter_card)
    print(f'Added Card Number {(cards + 1)} successfully')
    print('========================================================\n')

# Print how many twitter cards were scraped
print(f'Twitter Cards Added: {twitter_cards.__len__()}')
chrome_driver.quit()
chrome_driver.service.stop()

樣品輸出

Card Number: 1
Poster Account: https://twitter.com/BBC
Tweet Date: 2020-06-22T11:22:53.000Z
Likes: 1106
Replies: 2827
Retweets: 841
Tweet Content: We’ve always been here to celebrate diversity. But we need to do more, and we will. 

This is our commitment to long-term change. #RightTheScript
 
Read more about our £100m commitment here: https://bbc.in/37OPMLv
Added Card Number 1 successfully
========================================================

Card Number: 2
Poster Account: https://twitter.com/BBC
Tweet Date: 2020-11-16T17:01:00.000Z
Likes: 100
Replies: 10
Retweets: 36
Tweet Content: More than 100 intact sarcophagi, dating back 2,500 years, have been unearthed near Cairo.
Added Card Number 2 successfully
========================================================

Card Number: 3
Poster Account: https://twitter.com/BBC
Tweet Date: 2020-11-15T16:01:00.000Z
Likes: 68
Replies: 5
Retweets: 16
Tweet Content: With Cornish wildlife facing so many threats from humans, these residents do whatever they can to help
#Cornwall with
@simon_reeve
 | 8:10pm |
@bbctwo
 &
@bbciplayer
.
Added Card Number 3 successfully
========================================================

Card Number: 4
Poster Account: https://twitter.com/bbcasiannetwork
Tweet Date: 2020-11-14T09:44:41.000Z
Likes: 133
Replies: 7
Retweets: 33
Tweet Content: Happy Diwali and Bandi Chhor Divas!
Added Card Number 4 successfully
========================================================

Card Number: 5
Poster Account: https://twitter.com/BBC
Tweet Date: 2020-11-13T22:18:26.000Z
Likes: 443
Replies: 13
Retweets: 86
Tweet Content: It's the clash of the tennis titans
@Andy_Murray
 and... er,
@petercrouch
?
 #ChildrenInNeed
Added Card Number 5 successfully
========================================================

Card Number: 6
Poster Account: https://twitter.com/BBC
Tweet Date: 2020-11-13T20:57:23.000Z
Likes: 426
Replies: 25
Retweets: 109
Tweet Content: The official video for this year's star-studded
@bbccin
 single, 'Stop Crying Your Heart Out' is here!
Watch now and don't forget to download the song to support #ChildrenInNeed
  https://bbc.in/32I60EZ
Added Card Number 6 successfully
========================================================

Card Number: 7
Poster Account: https://twitter.com/BBC
Tweet Date: 2020-11-13T15:37:06.000Z
Likes: 18
Replies: 7
Retweets: 7
Tweet Content: It's time for #ChildrenInNeed
 2020!

Starting RIGHT NOW on
@BBCOne
 &
@BBCiPlayer


http://bbc.in/3kuv1cG
Added Card Number 7 successfully
========================================================

Twitter Cards Added: 7

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM