簡體   English   中英

使用python bs4時如何從嵌套標簽中獲取信息?

[英]How to get information from the nested tags when with a python bs4?

我想解析 GitHub 趨勢頁面,這是我的代碼:

import requests
from bs4 import BeautifulSoup

url_github = "https://github.com/trending"


def request_github_trending(url):
    request = requests.get(url)
    return request


def extract(page):
    soup = BeautifulSoup(page.content, 'html.parser')
    return soup.find_all('article', class_="Box-row")


def transform(html_repos):
    for repo in html_repos:
        stars = repo.find('a', class_="Link--muted d-inline-block mr-3")
        print(stars)
        break


print(transform(extract(request_github_trending(url_github))))

我想解析星星的數量然后我得到了這個結果:

<a class="Link--muted d-inline-block mr-3" data-view-component="true" href="/rocketseat-education/nlw6-discover/stargazers">
<svg aria-label="star" class="octicon octicon-star" data-view-component="true" height="16" role="img" version="1.1" viewbox="0 0 16 16" width="16">
<path d="M8 .25a.75.75 0 01.673.418l1.882 3.815 4.21.612a.75.75 0 01.416 1.279l-3.046 2.97.719 4.192a.75.75 0 01-1.088.791L8 12.347l-3.766 1.98a.75.75 0 01-1.088-.79l.72-4.194L.818 6.374a.75.75 0 01.416-1.28l4.21-.611L7.327.668A.75.75 0 018 .25zm0 2.445L6.615 5.5a.75.75 0 01-.564.41l-3.097.45 2.24 2.184a.75.75 0 01.216.664l-.528 3.084 2.769-1.456a.75.75 0 01.698 0l2.77 1.456-.53-3.084a.75.75 0 01.216-.664l2.24-2.183-3.096-.45a.75.75 0 01-.564-.41L8 2.694v.001z" fill-rule="evenodd"></path>
</svg>
        128
</a>
None

我怎么只能得到數字? 而且,我試圖解析存儲庫名稱和開發人員名稱。 但是搞砸了這個。 無法獲取開發者姓名,有倉庫名的情況只能獲取斜杠前的部分。 我將不勝感激任何幫助!

你很親近。 讓我們說..

output = <a class="Link--muted d-inline-block mr-3" data-view-component="true" href="/rocketseat-education/nlw6-discover/stargazers">
<svg aria-label="star" class="octicon octicon-star" data-view-component="true" height="16" role="img" version="1.1" viewbox="0 0 16 16" width="16">
<path d="M8 .25a.75.75 0 01.673.418l1.882 3.815 4.21.612a.75.75 0 01.416 1.279l-3.046 2.97.719 4.192a.75.75 0 01-1.088.791L8 12.347l-3.766 1.98a.75.75 0 01-1.088-.79l.72-4.194L.818 6.374a.75.75 0 01.416-1.28l4.21-.611L7.327.668A.75.75 0 018 .25zm0 2.445L6.615 5.5a.75.75 0 01-.564.41l-3.097.45 2.24 2.184a.75.75 0 01.216.664l-.528 3.084 2.769-1.456a.75.75 0 01.698 0l2.77 1.456-.53-3.084a.75.75 0 01.216-.664l2.24-2.183-3.096-.45a.75.75 0 01-.564-.41L8 2.694v.001z" fill-rule="evenodd"></path>
</svg>
        128
</a>

只需這樣做: output.text.strip() 你會得到128

避免這種函數調用 - transform(extract(request_github_trending(url_github)))

要獲得“星星”,您可以使用.get_text()方法。 要獲取“存儲庫”,您可以使用next_sibling方法

在這個例子中,我已經介紹了如何獲取所有信息,包括“存儲庫”、“星星”和開發人員名稱(“內置購買”)。

import requests
from bs4 import BeautifulSoup


url_github = "https://github.com/trending"


def request_github_trending(url):
    request = requests.get(url)
    return request


def extract(page):
    soup = BeautifulSoup(page.content, "html.parser")
    return soup.find_all("article", class_="Box-row")


def print_info(html):
    fmt_string = "{:<60} {:<30} {}"
    print(fmt_string.format("Repo", "Stars", "Built by"))
    print("-" * 150)
    for tag in html:
        repository_info = tag.find(class_="text-normal")
        repository = repository_info.text.strip() + repository_info.next_sibling.strip()

        stars = tag.find(class_="Link--muted d-inline-block mr-3").get_text(strip=True)

        usernames = [user["alt"] for user in tag.find_all("img")]
        print(fmt_string.format(repository, stars, usernames))


print_info(extract(request_github_trending(url_github)))

輸出:

Repo                                                         Stars                          Built by
------------------------------------------------------------------------------------------------------------------------------------------------------
rocketseat-education /nlw6-discover                          129                            ['@jakeliny']
six-ddc /plow                                                1,531                          ['@six-ddc', '@chenrui333', '@dependabot', '@musinit']
flutter /flutter                                             123,023                        ['@engine-flutter-autoroll', '@abarth', '@Hixie', '@jonahwilliams', '@HansMuller']
n8n-io /n8n                                                  15,781                         ['@janober', '@RicardoE105', '@ivov', '@Rupenieks', '@krynble']
PaddlePaddle /PaddleClas                                     1,521                          ['@littletomatodonkey', '@weisy11', '@dyning', '@Intsigstephon', '@cuicheng01']
...
...

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM