使用漂亮湯刮谷歌搜索框結果的pythonic方法

Question

我有一個 CSV 與以下列，名人姓名，url，raw_html。

raw_html 是 html 當您搜索celebrity_name姓名 + age時關聯到谷歌搜索，例如原始 html 關聯到以下搜索，

'https://www.google.com/search?source=hp&ei=DHMDX4XDLIqstQb1grfAAQ&q=' + '詹妮弗' + '+' + '洛佩茲' + '+' + '年齡'

導致這個谷歌頁面： https://www.google.com/search?source=hp&ei=DHMDX4XDLIqstQb1grfAAQ&q=Jennifer+Lopez+age

我想在我的 CSV 中存儲與所有名人相關的年齡。 我遇到的問題是當名人沒有與之關聯的谷歌答案框時。

我收到錯誤： list index out of range

事實上，我想將那一年存儲為NaN值

我的代碼：

age = {}
for iid, html in celeb_df[['celeb_id', 'raw_html']].values:
    if html.find_all('div', {'class' : ['HwtpBd gsrt PZPZlf']}) != None:
        for year in html.find_all('div', {'class' : ["Z0LcW XcVN5d", "Z0LcW XcVN5d AZCkJd"]})[0]: 
            print(iid, year)
            age[iid] = year

我想要的是一本字典，其中鍵是名人 id，值是與該名人相關的年齡。 如果名人沒有與之關聯的谷歌搜索框答案，那么我想將該值存儲為“NaN”

進行此操作的最佳方法是什么？

Answer 1

要將年齡值返回為NaN ，您可以用try/except塊將其包圍，如下所示：

try:
  age = soup.find('div', class_='class_name').text
except:
  age = None

在線 IDE中的代碼和示例：

from bs4 import BeautifulSoup
import requests, lxml, csv

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

# celebrity list
celeb_list = ['Jennifer Lopez', 'Chuck Norris', 'Jason Statham', 'Jackie Chan']

with open('awesome_file.csv', mode='w') as csv_file:
  # defining column names
  fieldnames = ['Celebrity name', 'Age']
  # defining .csv writer
  # https://docs.python.org/3/library/csv.html#csv.DictWriter
  writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
  # writing (creating) columns
  writer.writeheader()

  # collecting scraped data
  celeb_data = []
  
  # iterating over celebrity names list()
  for celeb in celeb_list:
    html = requests.get(f'https://www.google.com/search?q={celeb} age', headers=headers).text
    soup = BeautifulSoup(html, 'lxml')

    name = celeb
    try:
      age = soup.select_one('.XcVN5d').text
    except:
      age - None
    # because we have a csv.DictWriter() we converting to the required format
    # dict() keys should be exactly the same as fieldnames, otherwise it will throw an error
    celeb_data.append({
      'Celebrity name': name,
      'Age': age
    })
  # iterating over celebrity data list() that became dict() and writing it to the .csv
  for data in celeb_data:
    writer.writerow(data)

# CSV output from created file:
'''
Celebrity name,Age
Jennifer Lopez,51 years
Chuck Norris,81 years
Jason Statham,53 years
Jackie Chan,67 years
'''

或者，您可以使用來自 SerpApi 的Google Direct Answer Box API執行相同的操作。 它是付費的 API，可免費試用 5,000 次搜索。

從本質上講，這種特殊情況的主要區別是：

您不必弄清楚如何從頁面中獲取某些元素，它已經為最終用戶完成了。
使用JSON output，您將獲得更快的響應（查看在線 IDE 進行測試）。
如果 HTML 頁面（ CSS 選擇器、元素或其他內容）中的某些內容發生更改，則不必維護解析器。

要集成的代碼：

from serpapi import GoogleSearch
import os, csv

celeb_list = ['Jennifer Lopez', 'Chuck Norris', 'Jason Statham', 'Jackie Chan']

with open('serpapi_solution.csv', mode='w') as csv_file:
  fieldnames = ['Celebrity name', 'Age']
  writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
  writer.writeheader()

  celeb_data = []

  for celeb in celeb_list:
    params = {
      "api_key": os.getenv("API_KEY"),
      "engine": "google",
      "q": f"{celeb} age",
      "google_domain": "google.com",
    }

    search = GoogleSearch(params)
    results = search.get_dict()

    name = celeb
    try:
      age = results['answer_box']['answer']
    except:
      age = None
    print(age)

    celeb_data.append({
      'Celebrity name': name,
      'Age': age
    })

  for data in celeb_data:
    writer.writerow(data)

# CSV output from created file:
'''
Celebrity name,Age
Jennifer Lopez,51 years
Chuck Norris,81 years
Jason Statham,53 years
Jackie Chan,67 years
'''

免責聲明，我為 SerpApi 工作。

Answer 2

隨后，您還可以使用以下內容：

import bs4
import requests

result = bs4.BeautifulSoup(requests.get('https://www.nhc.noaa.gov/gis/').content, features='html.parser')
for link in result.find('table').find_all('a'):
    print(link.attrs['href'])

Answer 3

age = {}
for iid, html in celeb_df[['institution_id', 'raw_html']].values:
    if html.find_all('div', {'class' : ['Z0LcW XcVN5d AZCkJd']}):
        for year in html.find_all('div', {'class' : ["Z0LcW XcVN5d AZCkJd"]})[0]: 
            print(iid, year)
            age[iid] = year
    else:
        print(iid, 'NaN')
        age[iid] = 'NaN'

使用漂亮湯刮谷歌搜索框結果的pythonic方法

問題描述

3 個解決方案

解決方案1
2 2021-06-12 16:41:19

解決方案2
1

解決方案3
0 已采納 2020-07-08 03:58:20

使用漂亮湯刮谷歌搜索框結果的pythonic方法

問題描述

3 個解決方案

解決方案1 2 2021-06-12 16:41:19

解決方案2 1

解決方案3 0 已采納 2020-07-08 03:58:20

解決方案1
2 2021-06-12 16:41:19

解決方案2
1

解決方案3
0 已采納 2020-07-08 03:58:20