[英]How to scrape related searches on google?
当给定关键字列表时,我试图在谷歌上搜索相关搜索,然后将这些相关搜索 output 到一个 csv 文件中。 我的问题是获取漂亮的汤来识别相关搜索 html 标签。
这是源代码中的示例 html 标签:
<div data-ved="2ahUKEwitr8CPkLT3AhVRVsAKHVF-C80QmoICKAV6BAgEEBE">iphone xr</div>
这是我的网络驱动程序设置:
from selenium import webdriver
user_agent = 'Chrome/100.0.4896.60'
webdriver_options = webdriver.ChromeOptions()
webdriver_options.add_argument('user-agent={0}'.format(user_agent))
capabilities = webdriver_options.to_capabilities()
capabilities["acceptSslCerts"] = True
capabilities["acceptInsecureCerts"] = True
这是我的代码:
queries = ["iphone"]
driver = webdriver.Chrome(options=webdriver_options, desired_capabilities=capabilities, port=4444)
df2 = []
driver.get("https://google.com")
time.sleep(3)
driver.find_element(By.CSS_SELECTOR, "[aria-label='Agree to the use of cookies and other data for the purposes described']").click()
# get_current_related_searches
for query in queries:
driver.get("https://google.com/search?q=" + query)
time.sleep(3)
soup = BeautifulSoup(driver.page_source, 'html.parser')
p = soup.find_all('div data-ved')
print(p)
d = pd.DataFrame({'loop': 1, 'source': query, 'from': query, 'to': [s.text for s in p]})
terms = d["to"]
df2.append(d)
time.sleep(3)
df = pd.concat(df2).reset_index(drop=False)
df.to_csv("related_searches.csv")
它的 p=soup.find_all 是不正确的,我只是不确定如何让 BS 识别这些特定的 html 标签。 任何帮助都会很棒:)
@jakecohensol,正如您所指出的, p = soup.find_all
中的选择器是错误的。 正确的 CSS 选择器: .y6Uyqe.AB4Wff
。
Chrome/100.0.4896.60
User-Agent header 不正确。 Google 会阻止带有此类代理字符串的请求。 使用完整的用户代理字符串,Google 会返回正确的 HTML 响应。
无需浏览器即可抓取 Google 相关搜索。 它将更快、更可靠。
这是您的固定代码片段( 链接到在线 IDE 中的完整代码)
import time
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {
"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 14526.89.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.133 Safari/537.36"
}
queries = ["iphone", "pixel", "samsung"]
df2 = []
# get_current_related_searches
for query in queries:
params = {"q": query}
response = requests.get("https://google.com/search", params=params, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
p = soup.select(".y6Uyqe .AB4Wff")
d = pd.DataFrame(
{"loop": 1, "source": query, "from": query, "to": [s.text for s in p]}
)
terms = d["to"]
df2.append(d)
time.sleep(3)
df = pd.concat(df2).reset_index(drop=False)
df.to_csv("related_searches.csv")
样本 output:
,index,loop,source,from,to
0,0,1,iphone,iphone,iphone 13
1,1,1,iphone,iphone,iphone 12
2,2,1,iphone,iphone,iphone x
3,3,1,iphone,iphone,iphone 8
4,4,1,iphone,iphone,iphone 7
5,5,1,iphone,iphone,iphone xr
6,6,1,iphone,iphone,find my iphone
7,0,1,pixel,pixel,pixel 6
8,1,1,pixel,pixel,google pixel
9,2,1,pixel,pixel,pixel phone
10,3,1,pixel,pixel,pixel 6 pro
11,4,1,pixel,pixel,pixel 3
12,5,1,pixel,pixel,google pixel price
13,6,1,pixel,pixel,pixel 6 release date
14,0,1,samsung,samsung,samsung galaxy
15,1,1,samsung,samsung,samsung tv
16,2,1,samsung,samsung,samsung tablet
17,3,1,samsung,samsung,samsung account
18,4,1,samsung,samsung,samsung mobile
19,5,1,samsung,samsung,samsung store
20,6,1,samsung,samsung,samsung a21s
21,7,1,samsung,samsung,samsung login
查看SelectorGadget Chrome 扩展,通过单击浏览器中返回 HTML 元素的所需元素来获取 CSS 选择器。
检查你的用户代理是什么,或者为手机、平板电脑、个人电脑或不同的操作系统找到多个用户代理,以便轮换用户代理,从而减少一点被阻止的机会。
理想的场景是结合旋转用户代理和旋转代理(理想情况下是住宅),以及 CAPTCHA 求解器来解决最终会出现的 Google CAPTCHA。
作为替代方案,如果您不想弄清楚如何从头开始创建和维护解析器,或者如何绕过 Google(或其他搜索引擎)的阻止,则可以使用Google 搜索引擎结果 API来抓取 Google 搜索结果。
要集成的示例代码:
import os
from serpapi import GoogleSearch
queries = [
'banana',
'minecraft',
'apple stock',
'how to create a apple pie'
]
def serpapi_scrape_related_queries():
related_searches = []
for query in queries:
print(f'extracting related queries from query: {query}')
params = {
'api_key': os.getenv('API_KEY'), # your serpapi api key
'device': 'desktop', # device to retrive results from
'engine': 'google', # serpapi parsing engine
'q': query, # search query
'gl': 'us', # country of the search
'hl': 'en' # language of the search
}
search = GoogleSearch(params) # where data extracts on the backend
results = search.get_dict() # JSON -> dict
for result in results['related_searches']:
query = result['query']
link = result['link']
related_searches.append({
'query': query,
'link': link
})
pd.DataFrame(data=related_searches).to_csv('serpapi_related_queries.csv', index=False)
serpapi_scrape_related_queries()
部分dataframe output:
query link
0 banana benefits https://www.google.com/search?gl=us&hl=en&q=Ba...
1 banana republic https://www.google.com/search?gl=us&hl=en&q=Ba...
2 banana tree https://www.google.com/search?gl=us&hl=en&q=Ba...
3 banana meaning https://www.google.com/search?gl=us&hl=en&q=Ba...
4 banana plant https://www.google.com/search?gl=us&hl=en&q=Ba...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.