![](/img/trans.png)
[英]How to scrape a span within nested div's from google search results through beautiful soup python
[英]pythonic way to scrape google search box results using beautiful soup
我有一個 CSV 與以下列,名人姓名,url,raw_html。
raw_html 是 html 當您搜索celebrity_name
姓名 + age
時關聯到谷歌搜索,例如原始 html 關聯到以下搜索,
'https://www.google.com/search?source=hp&ei=DHMDX4XDLIqstQb1grfAAQ&q=' + '詹妮弗' + '+' + '洛佩茲' + '+' + '年齡'
導致這個谷歌頁面: https://www.google.com/search?source=hp&ei=DHMDX4XDLIqstQb1grfAAQ&q=Jennifer+Lopez+age
我想在我的 CSV 中存儲與所有名人相關的年齡。 我遇到的問題是當名人沒有與之關聯的谷歌答案框時。
我收到錯誤: list index out of range
事實上,我想將那一年存儲為NaN
值
我的代碼:
age = {}
for iid, html in celeb_df[['celeb_id', 'raw_html']].values:
if html.find_all('div', {'class' : ['HwtpBd gsrt PZPZlf']}) != None:
for year in html.find_all('div', {'class' : ["Z0LcW XcVN5d", "Z0LcW XcVN5d AZCkJd"]})[0]:
print(iid, year)
age[iid] = year
我想要的是一本字典,其中鍵是名人 id,值是與該名人相關的年齡。 如果名人沒有與之關聯的谷歌搜索框答案,那么我想將該值存儲為“NaN”
進行此操作的最佳方法是什么?
要將年齡值返回為NaN
,您可以用try/except
塊將其包圍,如下所示:
try:
age = soup.find('div', class_='class_name').text
except:
age = None
在線 IDE中的代碼和示例:
from bs4 import BeautifulSoup
import requests, lxml, csv
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
# celebrity list
celeb_list = ['Jennifer Lopez', 'Chuck Norris', 'Jason Statham', 'Jackie Chan']
with open('awesome_file.csv', mode='w') as csv_file:
# defining column names
fieldnames = ['Celebrity name', 'Age']
# defining .csv writer
# https://docs.python.org/3/library/csv.html#csv.DictWriter
writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
# writing (creating) columns
writer.writeheader()
# collecting scraped data
celeb_data = []
# iterating over celebrity names list()
for celeb in celeb_list:
html = requests.get(f'https://www.google.com/search?q={celeb} age', headers=headers).text
soup = BeautifulSoup(html, 'lxml')
name = celeb
try:
age = soup.select_one('.XcVN5d').text
except:
age - None
# because we have a csv.DictWriter() we converting to the required format
# dict() keys should be exactly the same as fieldnames, otherwise it will throw an error
celeb_data.append({
'Celebrity name': name,
'Age': age
})
# iterating over celebrity data list() that became dict() and writing it to the .csv
for data in celeb_data:
writer.writerow(data)
# CSV output from created file:
'''
Celebrity name,Age
Jennifer Lopez,51 years
Chuck Norris,81 years
Jason Statham,53 years
Jackie Chan,67 years
'''
或者,您可以使用來自 SerpApi 的Google Direct Answer Box API執行相同的操作。 它是付費的 API,可免費試用 5,000 次搜索。
從本質上講,這種特殊情況的主要區別是:
JSON
output,您將獲得更快的響應(查看在線 IDE 進行測試)。要集成的代碼:
from serpapi import GoogleSearch
import os, csv
celeb_list = ['Jennifer Lopez', 'Chuck Norris', 'Jason Statham', 'Jackie Chan']
with open('serpapi_solution.csv', mode='w') as csv_file:
fieldnames = ['Celebrity name', 'Age']
writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
writer.writeheader()
celeb_data = []
for celeb in celeb_list:
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": f"{celeb} age",
"google_domain": "google.com",
}
search = GoogleSearch(params)
results = search.get_dict()
name = celeb
try:
age = results['answer_box']['answer']
except:
age = None
print(age)
celeb_data.append({
'Celebrity name': name,
'Age': age
})
for data in celeb_data:
writer.writerow(data)
# CSV output from created file:
'''
Celebrity name,Age
Jennifer Lopez,51 years
Chuck Norris,81 years
Jason Statham,53 years
Jackie Chan,67 years
'''
免責聲明,我為 SerpApi 工作。
隨后,您還可以使用以下內容:
import bs4
import requests
result = bs4.BeautifulSoup(requests.get('https://www.nhc.noaa.gov/gis/').content, features='html.parser')
for link in result.find('table').find_all('a'):
print(link.attrs['href'])
age = {}
for iid, html in celeb_df[['institution_id', 'raw_html']].values:
if html.find_all('div', {'class' : ['Z0LcW XcVN5d AZCkJd']}):
for year in html.find_all('div', {'class' : ["Z0LcW XcVN5d AZCkJd"]})[0]:
print(iid, year)
age[iid] = year
else:
print(iid, 'NaN')
age[iid] = 'NaN'
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.