简体   繁体   English

使用漂亮的 Soup 从 Airtasker 中提取数据

[英]Extracting data from Airtasker using beautiful Soup

I am trying to extract data from this website - https://www.airtasker.com/users/brad-n-11346775/ .我正在尝试从该网站提取数据 - https://www.airtasker.com/users/brad-n-11346775/

So far, I have managed to extract everything except the license number .到目前为止,我已经设法提取了除许可证号之外的所有内容。 The problem I'm facing is bizarre as the license number is in the form of text.我面临的问题很奇怪,因为许可证号是文本形式的。 I was able to extract everything else like the Name, Address etc. For example, to extract the Name, I just did this:我能够提取其他所有内容,例如名称、地址等。例如,要提取名称,我只是这样做了:

name.append(pro.find('div', class_= 'name').text)

And it works just fine.它工作得很好。

This is what I have tried to do, but I'm getting the output as None这是我试图做的,但我将 output 设置为None

license_number.append(pro.find('div', class_= 'sub-text'))

When I do:当我做:

license_number.append(pro.find('div', class_= 'sub-text').text) 

It gives me the following error:它给了我以下错误:

AttributeError: 'NoneType' object has no attribute 'text'

That means it does not recognise the license number as a text, even though it is a text.这意味着它不会将许可证号识别为文本,即使它是文本。

Can someone please give me a workable solution and please tell me what am I doing wrong???有人可以给我一个可行的解决方案,请告诉我我做错了什么??? Regards,问候,

The badge with the license number is added to the HTML dynamically from a Boostrap JSON that sits in one of the <script> tags.带有许可证号的徽章从位于<script>标记之一中的Boostrap JSON HTML动态添加到 HTML。

You can find the tag with bs4 and scoop out the data with regex and parse it with json .您可以使用bs4找到标签并使用regex挖出数据并使用json对其进行解析。

Here's how:就是这样:

import ast
import json
import re

import requests
from bs4 import BeautifulSoup

page = requests.get("https://www.airtasker.com/users/brad-n-11346775/").text
scripts = BeautifulSoup(page, "lxml").find_all("script")[-4]
bootstrap_JSON = json.loads(
    ast.literal_eval(re.search(r"parse\((.*)\)", scripts.string).group(1))
)
print(bootstrap_JSON["profile"]["badges"]["electrical_vic"]["reference_code"])

Output: Output:

Licence No. 28661

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM