[英]Python - searching in Google and collecting title and description
I need to search in Google a few words from a CSV file, and to collect from the google search - URL, google description and title.我需要在 Google 中从 CSV 文件中搜索几个词,并从 google 搜索中收集 - URL、google 描述和标题。
I managed to write a script to search the words, collect only the URLs and store the results in a CSV file.我设法编写了一个脚本来搜索单词,只收集 URL 并将结果存储在 CSV 文件中。 I can't figure out how to collect - Title and google description.
我不知道如何收集 - 标题和谷歌描述。 Also, I need to return - "missing results" if the word I search is not found.
另外,如果找不到我搜索的单词,我需要返回 - “缺失结果”。
from bs4 import BeautifulSoup
from googlesearch import search
import pandas as pd
keywords = pd.read_csv('keywords.csv', header=0, index_col=None)
#print(keywords['keyword'])
df = pd.DataFrame(columns=['keyword', 'url'])
for i in keywords['keyword']:
print('Search results for keyword: ', i)
count = 0
for j in search(i, tld="co.in", num=10, stop=3, pause=2, lang='en'):
count = count + 1
print('URL number ',count, ': ', j)
df = df.append({'keyword': i, 'url': j}, ignore_index=True)
df.to_csv(r'final_dataset.csv', index=False)
user-agent
in order not to be blocked by Google while making a request.user-agent
,以免在发出请求时被 Google 阻止。What is my user-agent?CSS
selectors by clicking on the desired element.CSS
选择器。 Code and full example in the online IDE (note: the following code is not appending to .csv
, instead it will overwrite existing file): 在线 IDE 中的代码和完整示例(注意:以下代码未附加到
.csv
,而是会覆盖现有文件):
import pandas as pd
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
keywords = pd.read_csv('keywords.csv', header=0, index_col=None)
# collected data
data = []
for query in keywords['keywords']:
html = requests.get(f'https://www.google.com/search?q={query}', headers=headers)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
link = result.select_one('.yuRUbf a')['href']
displayed_link = result.select_one('.TbwUpd.NJjxre').text
try:
snippet = result.select_one('#rso .lyLwlc').text
except: snippet = None
print(f'{title}\n{link}\n{displayed_link}\n{snippet}\n')
# appending all data to array as dict()
data.append({
'title': title,
'link': link,
'displayed link': displayed_link,
'snippet': snippet
})
# create dataframe and save it as .csv
df = pd.DataFrame(data)
df.to_csv('bs4_final.csv', index=False)
Alternatively, you can do it as well by using Google Organic Results API from SerpAPI as αԋɱҽԃ αмєяιcαη mentioned in the comments.或者,您也可以通过使用来自 SerpAPI 的Google Organic Results API作为评论中提到的αԋɱҽԃ αмєяιcαη来做到这一点。 It's a paid API with a free plan.
这是一个带有免费计划的付费 API。
The main difference is that you only need to iterate over JSON string without doing any additional things.主要区别在于您只需要迭代 JSON 字符串而不做任何额外的事情。
Code to integrate:要集成的代码:
from serpapi import GoogleSearch
import os
import pandas as pd
keywords = pd.read_csv('keywords.csv', header=0, index_col=None)
for query in keywords['keywords']:
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": query,
"hl": "en",
}
search = GoogleSearch(params)
results = search.get_dict()
data = []
for result in results['organic_results']:
title = result['title']
link = result['link']
displayed_link = result['displayed_link']
snippet = result['snippet']
print(f"{title}\n{link}\n{displayed_link}\n{snippet}\n")
data.append({
'title': title,
'link': link,
'displayed link': displayed_link,
'snippet': snippet
})
df = pd.DataFrame(data)
df.to_csv('serpapi_final.csv', index=False)
PS - I wrote a blog post with visuals ( gifs ) about how to scrape Google Organic Search results. PS - 我写了一篇关于如何抓取谷歌有机搜索结果的带有视觉效果 ( gif ) 的博客文章。
Disclaimer, I work for SerpApi.
免责声明,我为 SerpApi 工作。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.