如何使用圖像抓取表格並導出到 Python 中的 Excel？

Question

我正在嘗試從URL中搜索一張表格

我可以使用 Scrapestorm 工具抓取表格數據。 我是 python 的新手，無法從此URL獲取數據。

from bs4 import BeautifulSoup

page = requests.get('https://pantheon.world/explore/rankings?show=people&years=-3501,2020')
soup = BeautifulSoup(page.text)

Excel 中所需的 output：

在此處輸入圖像描述

web 是否可以從網頁中抓取表格數據以及圖像？

Answer 1

當然，這是可能的。 但是，查看如何使用 JavaScript 異步填充此特定頁面的 DOM，BeautifulSoup 將無法看到您嘗試抓取的數據。 通常，這是大多數人建議您使用無頭瀏覽器/網絡驅動程序的地方，例如 Selenium 或 PlayWright 來模擬瀏覽 session - 但您很幸運。 對於此特定頁面，您不需要無頭瀏覽器或 Scrapestorm 或 BeautifulSoup - 您只需要第三方requests模塊。 當您訪問此頁面時，它恰好向服務於 Z0ECD11C1D7A2BD7A22.0 的 REST API 發出 HTTP GET 請求。 JSON 響應包含表中的所有信息。 如果您記錄瀏覽器的網絡流量，您可以看到對 API 的請求：

這是響應 JSON 的樣子 - 字典列表：

從那里，您可以復制 API URL 和相關的查詢字符串參數，以制定您自己對該 API 的請求：

def main():

    import requests

    url = "https://api.pantheon.world/person_ranks"

    params = {
        "select": "name,l,l_,age,non_en_page_views,coefficient_of_variation,hpi,hpi_prev,id,slug,gender,birthyear,deathyear,bplace_country(id,country,continent,slug),bplace_geonameid(id,place,country,slug,lat,lon),dplace_country(id,country,slug),dplace_geonameid(id,place,country,slug),occupation_id:occupation,occupation(id,occupation,occupation_slug,industry,domain),rank,rank_prev,rank_delta",
        "birthyear": "gte.-3501",
        "birthyear": "lte.2020",
        "hpi": "gte.0",
        "order": "hpi.desc.nullslast",
        "limit": "50",
        "offset": "0"
    }

    response = requests.get(url, params=params)
    response.raise_for_status()

    for person in response.json():
        print(f"{person['name']} was a {person['occupation']['occupation']}")
    
    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

Output：

Muhammad was a RELIGIOUS FIGURE
Genghis Khan was a MILITARY PERSONNEL
Leonardo da Vinci was a INVENTOR
Isaac Newton was a PHYSICIST
Ludwig van Beethoven was a COMPOSER
Alexander the Great was a MILITARY PERSONNEL
Aristotle was a PHILOSOPHER
...

從這里將這些信息寫入 CSV 或 excel 文件是很簡單的。 您可以使用params查詢字符串參數字典中的"limit": "50"和"offset": "0"鍵值對，來檢索不同人的信息。

編輯 - 要獲取每個人的縮略圖，您需要構建以下形式的 URL：

https://pantheon.world/images/profile/people/{PERSON_ID}.jpg

其中{PERSON_ID}是與給定人員的id鍵關聯的值：

...

for person in response.json():
    image_url = f"https://pantheon.world/images/profile/people/{person['id']}.jpg"
    print(f"{person['name']}'s image URL: {image_url}")

如果您將openpyxl用於 excel 文件，這里有一個有用的答案，它向您展示了如何在給定圖像 URL 的單元格中插入圖像。 但是，我建議您使用requests而不是urllib3來向圖像發出請求。

如何使用圖像抓取表格並導出到 Python 中的 Excel？

問題描述

1 個解決方案

解決方案1
3 已采納 2020-12-23 10:47:33

如何使用圖像抓取表格並導出到 Python 中的 Excel？

問題描述

1 個解決方案

解決方案1 3 已采納 2020-12-23 10:47:33

解決方案1
3 已采納 2020-12-23 10:47:33