简体   繁体   English

有没有一种特定的方法可以使用 python3 从 json 文件中只读取 url?

[英]Is there a spefic way to read only the urls from a json file with python3?

I'm trying to collect the links from the google history and past their content to .txt files.我正在尝试从 google 历史记录中收集链接并将其内容传递到 .txt 文件。 All the code works (when I created a json which contains only urls), but when the links are there as in the Source example, then I get the error mentioned below.所有代码都有效(当我创建一个只包含 url 的 json 时),但是当链接在源示例中时,我会收到下面提到的错误。 I suspect it's because of the “ in the source data, but how can I get it to just read the URL part?我怀疑这是因为源数据中的“,但是我怎样才能让它只读取 URL 部分?

Source Data:源数据:

{
    "Browser History": [
        {
            "favicon_url": "https://www.google.com/favicon.ico",
            "page_transition": "LINK",
            "title": "Google Datenexport",
            "url": "https://takeout.google.com/",
            "client_id": "cWD5MfDDekj1z9aA5VeCQQ\u003d\u003d",
            "time_usec": 1607693084782187
},
        {
            "favicon_url": "https://support.google.com/favicon.ico",
            "page_transition": "LINK",
            "title": "So laden Sie Ihre Google-Daten herunter - Google-Konto-Hilfe",
            "url": "https://support.google.com/accounts/answer/3024190?visit_id\u003d637432898341218017-3159218066\u0026hl\u003dde\u0026rd\u003d1",
            "client_id": "cWD5MfDDekj1z9aA5VeCQQ\u003d\u003d",
            "time_usec": 1607693036534748
},
        {
            "favicon_url": "https://www.google.com/favicon.ico",
            "page_transition": "LINK",
            "title": "Google \u2013 Meine Aktivitäten",
            "url": "https://myactivity.google.com/activitycontrols/webandapp?view\u003ditem",
            "client_id": "cWD5MfDDekj1z9aA5VeCQQ\u003d\u003d",
            "time_usec": 1607693013403569
},
        {
            "favicon_url": "https://www.com-magazin.de/favicon.ico",
            "page_transition": "LINK",
            "title": "Google-Suchverlauf herunterladen und deaktivieren - com! professional",
            "url": "https://www.com-magazin.de/news/google/google-suchverlauf-herunterladen-deaktivieren-928063.html#:~:text\u003dUm%20die%20eigenen%20Suchanfragen%20herunterzuladen,Nutzer%20den%20Eintrag%20%22Herunterladen%22.",
            "client_id": "cWD5MfDDekj1z9aA5VeCQQ\u003d\u003d",
            "time_usec": 1607692994577620
}```
 
  

The code I use at the moment:我目前使用的代码:

import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from dhooks import Webhook, Embed
import json


def getHtml(url):
    global response
    ua = UserAgent()
    {'user-agent': ua.random}
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()
    except Exception as e:
        print(e)
    return response.content


with open('urls.json', 'r') as history:
    json_data = history.read()
    data = json.loads(json_data)

    for block in data:
        print("scraping " + block["url"] + "...")
        html = getHtml(json_data)
        soup = BeautifulSoup(markup, "html5lib")
        text = soup.find_all(text=True)

        output = ''

        blacklist = [
            "style",
            "url",
            "404",
            "ngnix",
            "url"

        ]

        for t in text:
            if t.parent.name not in blacklist:
                output += '{} '.format(t)

        with open("{}.txt".format(i), "w") as out_fd:
            out_fd.write(output)


If your Source Data looks something like that,如果您的源数据看起来像这样,

[{
            "page_transition": "LINK",
            "title": "Niedersachsen nimmt geplante Corona-Lockerungen für Silvester zurück",
            "url": "https://www.rnd.de/politik/niedersachsen-nimmt-geplante-corona-lockerungen-fur-silvester-zuruck-IEW2P4XT4M24ZFFZSX7ILE6JGM.html?outputType\u003damp\u0026utm_source\u003dupday\u0026utm_medium\u003dreferral",
            "client_id": "59VD9fg/2RVO1jSDxOwfxw\u003d\u003d",
            "time_usec": 1607593733981438
},{
            "page_transition": "LINK",
            "title": "Niedersachsen nimmt geplante Corona-Lockerungen für Silvester zurück",
            "url": "https://www.rnd.de/politik/niedersachsen-nimmt-geplante-corona-lockerungen-fur-silvester-zuruck-IEW2P4XT4M24ZFFZSX7ILE6JGM.html?outputType\u003damp\u0026utm_source\u003dupday\u0026utm_medium\u003dreferral",
            "client_id": "59VD9fg/2RVO1jSDxOwfxw\u003d\u003d",
            "time_usec": 1607593733981438
}, {
            "page_transition": "LINK",
            "title": "Niedersachsen nimmt geplante Corona-Lockerungen für Silvester zurück",
            "url": "https://www.rnd.de/politik/niedersachsen-nimmt-geplante-corona-lockerungen-fur-silvester-zuruck-IEW2P4XT4M24ZFFZSX7ILE6JGM.html?outputType\u003damp\u0026utm_source\u003dupday\u0026utm_medium\u003dreferral",
            "client_id": "59VD9fg/2RVO1jSDxOwfxw\u003d\u003d",
            "time_usec": 1607593733981438
}]

and if I'm reading your question correctly, it's better off parsing your Source Data as JSON directly, and grab the URL with the 'url' key.如果我正确阅读了您的问题,最好将您的源数据直接解析为 JSON,并使用 'url' 键获取 URL。

with open ('history.json','r') as history:
    json_data = history.read()
    data = json.loads(json_data)

    for k, v in data.items():  #because now your Source data is a dictionary
        for block in v:        #because v is the list of textblocks
            print("scraping " + block["url"] + "...")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM