简体   繁体   English

Python - 在 Google 中搜索并收集标题和描述

[英]Python - searching in Google and collecting title and description

I need to search in Google a few words from a CSV file, and to collect from the google search - URL, google description and title.我需要在 Google 中从 CSV 文件中搜索几个词,并从 google 搜索中收集 - URL、google 描述和标题。

I managed to write a script to search the words, collect only the URLs and store the results in a CSV file.我设法编写了一个脚本来搜索单词,只收集 URL 并将结果存储在 CSV 文件中。 I can't figure out how to collect - Title and google description.我不知道如何收集 - 标题和谷歌描述。 Also, I need to return - "missing results" if the word I search is not found.另外,如果找不到我搜索的单词,我需要返回 - “缺失结果”。

from bs4 import BeautifulSoup
from googlesearch import search
import pandas as pd

keywords = pd.read_csv('keywords.csv', header=0, index_col=None)

#print(keywords['keyword'])

df = pd.DataFrame(columns=['keyword', 'url'])

for i in keywords['keyword']:
    print('Search results for keyword: ', i)
    count = 0
    for j in search(i, tld="co.in", num=10, stop=3, pause=2, lang='en'):
        count = count + 1
        print('URL number ',count, ': ', j)
        df = df.append({'keyword': i, 'url': j}, ignore_index=True)

df.to_csv(r'final_dataset.csv', index=False)
  • specify user-agent in order not to be blocked by Google while making a request.指定user-agent ,以免在发出请求时被 Google 阻止。What is my user-agent?我的用户代理是什么?
  • try SelectorGadget Chrome extension to visually grab CSS selectors by clicking on the desired element.尝试SelectorGadget Chrome 扩展程序,通过单击所需元素来直观地抓取CSS选择器。

Code and full example in the online IDE (note: the following code is not appending to .csv , instead it will overwrite existing file): 在线 IDE 中的代码和完整示例(注意:以下代码附加到.csv ,而是会覆盖现有文件):

import pandas as pd
from bs4 import BeautifulSoup
import requests, lxml

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

keywords = pd.read_csv('keywords.csv', header=0, index_col=None)

# collected data
data = []

for query in keywords['keywords']:
  html = requests.get(f'https://www.google.com/search?q={query}', headers=headers)
  soup = BeautifulSoup(html.text, 'lxml')

  for result in soup.select('.tF2Cxc'):
    title = result.select_one('.DKV0Md').text
    link = result.select_one('.yuRUbf a')['href']
    displayed_link = result.select_one('.TbwUpd.NJjxre').text

    try:
      snippet = result.select_one('#rso .lyLwlc').text
    except: snippet = None

    print(f'{title}\n{link}\n{displayed_link}\n{snippet}\n')

    # appending all data to array as dict()
    data.append({
      'title': title,
      'link': link,
      'displayed link': displayed_link,
      'snippet': snippet
    })

  # create dataframe and save it as .csv
  df = pd.DataFrame(data)
  df.to_csv('bs4_final.csv', index=False)

Alternatively, you can do it as well by using Google Organic Results API from SerpAPI as αԋɱҽԃ αмєяιcαη mentioned in the comments.或者,您也可以通过使用来自 SerpAPI 的Google Organic Results API作为评论中提到的αԋɱҽԃ αмєяιcαη来做到这一点。 It's a paid API with a free plan.这是一个带有免费计划的付费 API。

The main difference is that you only need to iterate over JSON string without doing any additional things.主要区别在于您只需要迭代 JSON 字符串而不做任何额外的事情。

Code to integrate:要集成的代码:

from serpapi import GoogleSearch
import os
import pandas as pd

keywords = pd.read_csv('keywords.csv', header=0, index_col=None)

for query in keywords['keywords']:
    params = {
      "api_key": os.getenv("API_KEY"),
      "engine": "google",
      "q": query,
      "hl": "en",
    }

    search = GoogleSearch(params)
    results = search.get_dict()

    data = []

    for result in results['organic_results']:
        title = result['title']
        link = result['link']
        displayed_link = result['displayed_link']
        snippet = result['snippet']
                
        print(f"{title}\n{link}\n{displayed_link}\n{snippet}\n")

        data.append({
          'title': title,
          'link': link,
          'displayed link': displayed_link,
          'snippet': snippet
        })

    df = pd.DataFrame(data)
    df.to_csv('serpapi_final.csv', index=False)

PS - I wrote a blog post with visuals ( gifs ) about how to scrape Google Organic Search results. PS - 我写了一篇关于如何抓取谷歌有机搜索结果的带有视觉效果 ( gif ) 的博客文章

Disclaimer, I work for SerpApi.免责声明,我为 SerpApi 工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM