簡體   English   中英

使用漂亮湯刮谷歌搜索框結果的pythonic方法

[英]pythonic way to scrape google search box results using beautiful soup

我有一個 CSV 與以下列,名人姓名,url,raw_html。

raw_html 是 html 當您搜索celebrity_name姓名 + age時關聯到谷歌搜索,例如原始 html 關聯到以下搜索,

'https://www.google.com/search?source=hp&ei=DHMDX4XDLIqstQb1grfAAQ&q=' + '詹妮弗' + '+' + '洛佩茲' + '+' + '年齡'

導致這個谷歌頁面: https://www.google.com/search?source=hp&ei=DHMDX4XDLIqstQb1grfAAQ&q=Jennifer+Lopez+age

我想在我的 CSV 中存儲與所有名人相關的年齡。 我遇到的問題是當名人沒有與之關聯的谷歌答案框時。

我收到錯誤: list index out of range

事實上,我想將那一年存儲為NaN

我的代碼:

age = {}
for iid, html in celeb_df[['celeb_id', 'raw_html']].values:
    if html.find_all('div', {'class' : ['HwtpBd gsrt PZPZlf']}) != None:
        for year in html.find_all('div', {'class' : ["Z0LcW XcVN5d", "Z0LcW XcVN5d AZCkJd"]})[0]: 
            print(iid, year)
            age[iid] = year

我想要的是一本字典,其中鍵是名人 id,值是與該名人相關的年齡。 如果名人沒有與之關聯的谷歌搜索框答案,那么我想將該值存儲為“NaN”

進行此操作的最佳方法是什么?

要將年齡值返回為NaN ,您可以用try/except塊將其包圍,如下所示:

try:
  age = soup.find('div', class_='class_name').text
except:
  age = None

在線 IDE中的代碼和示例:

from bs4 import BeautifulSoup
import requests, lxml, csv

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

# celebrity list
celeb_list = ['Jennifer Lopez', 'Chuck Norris', 'Jason Statham', 'Jackie Chan']

with open('awesome_file.csv', mode='w') as csv_file:
  # defining column names
  fieldnames = ['Celebrity name', 'Age']
  # defining .csv writer
  # https://docs.python.org/3/library/csv.html#csv.DictWriter
  writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
  # writing (creating) columns
  writer.writeheader()

  # collecting scraped data
  celeb_data = []
  
  # iterating over celebrity names list()
  for celeb in celeb_list:
    html = requests.get(f'https://www.google.com/search?q={celeb} age', headers=headers).text
    soup = BeautifulSoup(html, 'lxml')

    name = celeb
    try:
      age = soup.select_one('.XcVN5d').text
    except:
      age - None
    # because we have a csv.DictWriter() we converting to the required format
    # dict() keys should be exactly the same as fieldnames, otherwise it will throw an error
    celeb_data.append({
      'Celebrity name': name,
      'Age': age
    })
  # iterating over celebrity data list() that became dict() and writing it to the .csv
  for data in celeb_data:
    writer.writerow(data)

# CSV output from created file:
'''
Celebrity name,Age
Jennifer Lopez,51 years
Chuck Norris,81 years
Jason Statham,53 years
Jackie Chan,67 years
'''

或者,您可以使用來自 SerpApi 的Google Direct Answer Box API執行相同的操作。 它是付費的 API,可免費試用 5,000 次搜索。

從本質上講,這種特殊情況的主要區別是:

  • 您不必弄清楚如何從頁面中獲取某些元素,它已經為最終用戶完成了。
  • 使用JSON output,您將獲得更快的響應(查看在線 IDE 進行測試)。
  • 如果 HTML 頁面( CSS 選擇器、元素或其他內容)中的某些內容發生更改,則不必維護解析器。

要集成的代碼:

from serpapi import GoogleSearch
import os, csv

celeb_list = ['Jennifer Lopez', 'Chuck Norris', 'Jason Statham', 'Jackie Chan']

with open('serpapi_solution.csv', mode='w') as csv_file:
  fieldnames = ['Celebrity name', 'Age']
  writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
  writer.writeheader()

  celeb_data = []

  for celeb in celeb_list:
    params = {
      "api_key": os.getenv("API_KEY"),
      "engine": "google",
      "q": f"{celeb} age",
      "google_domain": "google.com",
    }

    search = GoogleSearch(params)
    results = search.get_dict()

    name = celeb
    try:
      age = results['answer_box']['answer']
    except:
      age = None
    print(age)

    celeb_data.append({
      'Celebrity name': name,
      'Age': age
    })

  for data in celeb_data:
    writer.writerow(data)

# CSV output from created file:
'''
Celebrity name,Age
Jennifer Lopez,51 years
Chuck Norris,81 years
Jason Statham,53 years
Jackie Chan,67 years
'''

免責聲明,我為 SerpApi 工作。

隨后,您還可以使用以下內容:

import bs4
import requests

result = bs4.BeautifulSoup(requests.get('https://www.nhc.noaa.gov/gis/').content, features='html.parser')
for link in result.find('table').find_all('a'):
    print(link.attrs['href'])
age = {}
for iid, html in celeb_df[['institution_id', 'raw_html']].values:
    if html.find_all('div', {'class' : ['Z0LcW XcVN5d AZCkJd']}):
        for year in html.find_all('div', {'class' : ["Z0LcW XcVN5d AZCkJd"]})[0]: 
            print(iid, year)
            age[iid] = year
    else:
        print(iid, 'NaN')
        age[iid] = 'NaN'

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM