使用漂亮的 Soup 從 Airtasker 中提取數據

Question

我正在嘗試從該網站提取數據 - https://www.airtasker.com/users/brad-n-11346775/ 。

到目前為止，我已經設法提取了除許可證號之外的所有內容。 我面臨的問題很奇怪，因為許可證號是文本形式的。 我能夠提取其他所有內容，例如名稱、地址等。例如，要提取名稱，我只是這樣做了：

name.append(pro.find('div', class_= 'name').text)

它工作得很好。

這是我試圖做的，但我將 output 設置為None

license_number.append(pro.find('div', class_= 'sub-text'))

當我做：

license_number.append(pro.find('div', class_= 'sub-text').text)

它給了我以下錯誤：

AttributeError: 'NoneType' object has no attribute 'text'

這意味着它不會將許可證號識別為文本，即使它是文本。

有人可以給我一個可行的解決方案，請告訴我我做錯了什么？？？ 問候，

Answer 1

帶有許可證號的徽章從位於<script>標記之一中的Boostrap JSON HTML動態添加到 HTML。

您可以使用bs4找到標簽並使用regex挖出數據並使用json對其進行解析。

就是這樣：

import ast
import json
import re

import requests
from bs4 import BeautifulSoup

page = requests.get("https://www.airtasker.com/users/brad-n-11346775/").text
scripts = BeautifulSoup(page, "lxml").find_all("script")[-4]
bootstrap_JSON = json.loads(
    ast.literal_eval(re.search(r"parse\((.*)\)", scripts.string).group(1))
)
print(bootstrap_JSON["profile"]["badges"]["electrical_vic"]["reference_code"])

Output：

Licence No. 28661

使用漂亮的 Soup 從 Airtasker 中提取數據

問題描述

1 個解決方案

解決方案1
2 已采納 2021-05-16 13:31:05

使用漂亮的 Soup 從 Airtasker 中提取數據

問題描述

1 個解決方案

解決方案1 2 已采納 2021-05-16 13:31:05

解決方案1
2 已采納 2021-05-16 13:31:05