簡體   English   中英

有沒有一種特定的方法可以使用 python3 從 json 文件中只讀取 url?

[英]Is there a spefic way to read only the urls from a json file with python3?

我正在嘗試從 google 歷史記錄中收集鏈接並將其內容傳遞到 .txt 文件。 所有代碼都有效(當我創建一個只包含 url 的 json 時),但是當鏈接在源示例中時,我會收到下面提到的錯誤。 我懷疑這是因為源數據中的“,但是我怎樣才能讓它只讀取 URL 部分?

源數據:

{
    "Browser History": [
        {
            "favicon_url": "https://www.google.com/favicon.ico",
            "page_transition": "LINK",
            "title": "Google Datenexport",
            "url": "https://takeout.google.com/",
            "client_id": "cWD5MfDDekj1z9aA5VeCQQ\u003d\u003d",
            "time_usec": 1607693084782187
},
        {
            "favicon_url": "https://support.google.com/favicon.ico",
            "page_transition": "LINK",
            "title": "So laden Sie Ihre Google-Daten herunter - Google-Konto-Hilfe",
            "url": "https://support.google.com/accounts/answer/3024190?visit_id\u003d637432898341218017-3159218066\u0026hl\u003dde\u0026rd\u003d1",
            "client_id": "cWD5MfDDekj1z9aA5VeCQQ\u003d\u003d",
            "time_usec": 1607693036534748
},
        {
            "favicon_url": "https://www.google.com/favicon.ico",
            "page_transition": "LINK",
            "title": "Google \u2013 Meine Aktivitäten",
            "url": "https://myactivity.google.com/activitycontrols/webandapp?view\u003ditem",
            "client_id": "cWD5MfDDekj1z9aA5VeCQQ\u003d\u003d",
            "time_usec": 1607693013403569
},
        {
            "favicon_url": "https://www.com-magazin.de/favicon.ico",
            "page_transition": "LINK",
            "title": "Google-Suchverlauf herunterladen und deaktivieren - com! professional",
            "url": "https://www.com-magazin.de/news/google/google-suchverlauf-herunterladen-deaktivieren-928063.html#:~:text\u003dUm%20die%20eigenen%20Suchanfragen%20herunterzuladen,Nutzer%20den%20Eintrag%20%22Herunterladen%22.",
            "client_id": "cWD5MfDDekj1z9aA5VeCQQ\u003d\u003d",
            "time_usec": 1607692994577620
}```
 
  

我目前使用的代碼:

import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from dhooks import Webhook, Embed
import json


def getHtml(url):
    global response
    ua = UserAgent()
    {'user-agent': ua.random}
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()
    except Exception as e:
        print(e)
    return response.content


with open('urls.json', 'r') as history:
    json_data = history.read()
    data = json.loads(json_data)

    for block in data:
        print("scraping " + block["url"] + "...")
        html = getHtml(json_data)
        soup = BeautifulSoup(markup, "html5lib")
        text = soup.find_all(text=True)

        output = ''

        blacklist = [
            "style",
            "url",
            "404",
            "ngnix",
            "url"

        ]

        for t in text:
            if t.parent.name not in blacklist:
                output += '{} '.format(t)

        with open("{}.txt".format(i), "w") as out_fd:
            out_fd.write(output)


如果您的源數據看起來像這樣,

[{
            "page_transition": "LINK",
            "title": "Niedersachsen nimmt geplante Corona-Lockerungen für Silvester zurück",
            "url": "https://www.rnd.de/politik/niedersachsen-nimmt-geplante-corona-lockerungen-fur-silvester-zuruck-IEW2P4XT4M24ZFFZSX7ILE6JGM.html?outputType\u003damp\u0026utm_source\u003dupday\u0026utm_medium\u003dreferral",
            "client_id": "59VD9fg/2RVO1jSDxOwfxw\u003d\u003d",
            "time_usec": 1607593733981438
},{
            "page_transition": "LINK",
            "title": "Niedersachsen nimmt geplante Corona-Lockerungen für Silvester zurück",
            "url": "https://www.rnd.de/politik/niedersachsen-nimmt-geplante-corona-lockerungen-fur-silvester-zuruck-IEW2P4XT4M24ZFFZSX7ILE6JGM.html?outputType\u003damp\u0026utm_source\u003dupday\u0026utm_medium\u003dreferral",
            "client_id": "59VD9fg/2RVO1jSDxOwfxw\u003d\u003d",
            "time_usec": 1607593733981438
}, {
            "page_transition": "LINK",
            "title": "Niedersachsen nimmt geplante Corona-Lockerungen für Silvester zurück",
            "url": "https://www.rnd.de/politik/niedersachsen-nimmt-geplante-corona-lockerungen-fur-silvester-zuruck-IEW2P4XT4M24ZFFZSX7ILE6JGM.html?outputType\u003damp\u0026utm_source\u003dupday\u0026utm_medium\u003dreferral",
            "client_id": "59VD9fg/2RVO1jSDxOwfxw\u003d\u003d",
            "time_usec": 1607593733981438
}]

如果我正確閱讀了您的問題,最好將您的源數據直接解析為 JSON,並使用 'url' 鍵獲取 URL。

with open ('history.json','r') as history:
    json_data = history.read()
    data = json.loads(json_data)

    for k, v in data.items():  #because now your Source data is a dictionary
        for block in v:        #because v is the list of textblocks
            print("scraping " + block["url"] + "...")

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM