简体   繁体   中英

Retrieve all elements in Google Trends Data using selenium Python

I am trying to write a Python program to gather data from Google Trends (GT)- specifically, I want to automatically open URLs and access the specific values that are displayed in the title. I have written the code and i am able to scrape data successfully. But i compare the data returned by code and one present in the url, the results are only partially returned. For eg in the below image, the code returns the first title "Manchester United FC • Tottenham Hotspur FC" But the actual website has 4 results "Manchester United FC • Tottenham Hotspur FC , International Champions Cup, Manchester ". google trends image

screenshot output of code

We have currently tried all all possible locate elements in a page but we are still unable to fund a fix for this. We didn't want to use scrapy or beautiful soup for this

    import pandas as pd
    import requests
    import re
    from bs4 import BeautifulSoup
    import time
    from selenium import webdriver

    links=["https://trends.google.com/trends/trendingsearches/realtime?geo=DE&category=s"] 

    for link in links:
        Title_temp=[]
        Titile=''
        seleniumDriver = r"C:/Users/Downloads/chromedriver_win32/chromedriver.exe" 
        chrome_options = Options()
        brow = webdriver.Chrome(executable_path=seleniumDriver, chrome_options=chrome_options)
        try:
            brow.get(link) ## getting the url
            try:
                content = brow.find_elements_by_class_name("details-top")
                for element in content:
                    Title_temp.append(element.text)    
                Title=' '.join(Title_temp)
            except:
                Title=''       
            brow.quit()

        except Exception as error:
            print error
            break

    Final_df = pd.DataFrame(
        {'Title': Title_temp
        })

Here is the code which printed all the information.

url = "https://trends.google.com/trends/trendingsearches/realtime?geo=DE&category=s"
driver.get(url)
WebDriverWait(driver,30).until(EC.presence_of_element_located((By.CLASS_NAME,'details-top')))
Title_temp = []
try:
    content = driver.find_elements_by_class_name("details-top")
    for element in content:
        Title_temp.append(element.text)
    Title=' '.join(Title_temp)
except:
    Title=''
print(Title_temp)
driver.close()

Here is the output.

['Hertha BSC • Fenerbahçe SK • Bundesliga • Ante Čović • Berlin', 'Eintracht Frankfurt • UEFA Europa League • Tallinn • Estonia • Frankfurt', 'FC Augsburg • Galatasaray SK • Martin Schmidt • Bundesliga • Stefan Reuter', 'Austria national football team • FIFA • Austria • FIFA World Rankings', 'Lechia Gdańsk • Brøndby IF • 2019–20 UEFA Europa League • Gdańsk', 'Alexander Zverev • Hamburg', 'Julian Lenz • Association of Tennis Professionals • Alexander Zverev', 'UEFA Europa League • Diego • Nairo Quintana • Tour de France']

Screenshot:

在此处输入图片说明

We were able to find a fix for this. We had to scrape data from inner html and then do some bit of data cleaning to get required records

import pandas as pd
import requests
import re
from bs4 import BeautifulSoup
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

#html parser
def parse_html(content):    
    from bs4 import BeautifulSoup
    from bs4.element import Comment
    soup = BeautifulSoup(content, 'html.parser')
    text_elements = soup.findAll(text=True)
    tag_blacklist = ['style', 'script', 'head', 'title', 'meta', '[document]','img']
    clean_text = []
    for element in text_elements:
        if element.parent.name in tag_blacklist or isinstance(element, Comment):
            continue
        else:
            text_ = element.strip()
            clean_text.append(text_)
    result_text = " ".join(clean_text)
    result_text = result_text.replace(r'[\r\n]','')
    tag_remove_pattern = re.compile(r'<[^>]+>')
    result_text = tag_remove_pattern.sub('', result_text)
    result_text = re.sub(r'\\','',result_text)
    return result_text

seleniumDriver = r"./chromedriver.exe" 
chrome_options = Options()
brow = webdriver.Chrome(executable_path=seleniumDriver, chrome_options=chrome_options)
links=["https://trends.google.com/trends/trendingsearches/realtime?geo=DE&category=s"]
title_temp = []
for link in links:
    try:
        brow.get(link)
        try:
            elements = brow.find_elements_by_class_name('details-top')
            for element in elements:
                html_text = parse_html(element.get_attribute("innerHTML"))
                title_temp.append(html_text.replace('share','').strip())
        except Exception as error:
            print(error)
        time.sleep(1)
        brow.quit()
    except Exception as error:
        print(error)
        break
Final_df = pd.DataFrame(
    {'Title': title_temp
    })

print(Final_df)

From what I see, data is retrieved from an API endpoint you can call direct. I show how to call and then extract only the title (note more info is returned other than just title from API call). You can explore the breadth of what is returned (which includes article snippets, urls, image links etc) here .

import requests
import json

r = requests.get('https://trends.google.com/trends/api/realtimetrends?hl=en-GB&tz=-60&cat=s&fi=0&fs=0&geo=DE&ri=300&rs=20&sort=0')
data = json.loads(r.text[5:])
titles = [story['title'] for story in data['storySummaries']['trendingStories']]
print(titles)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM