[英]Is there a spefic way to read only the urls from a json file with python3?
我正在嘗試從 google 歷史記錄中收集鏈接並將其內容傳遞到 .txt 文件。 所有代碼都有效(當我創建一個只包含 url 的 json 時),但是當鏈接在源示例中時,我會收到下面提到的錯誤。 我懷疑這是因為源數據中的“,但是我怎樣才能讓它只讀取 URL 部分?
源數據:
{
"Browser History": [
{
"favicon_url": "https://www.google.com/favicon.ico",
"page_transition": "LINK",
"title": "Google Datenexport",
"url": "https://takeout.google.com/",
"client_id": "cWD5MfDDekj1z9aA5VeCQQ\u003d\u003d",
"time_usec": 1607693084782187
},
{
"favicon_url": "https://support.google.com/favicon.ico",
"page_transition": "LINK",
"title": "So laden Sie Ihre Google-Daten herunter - Google-Konto-Hilfe",
"url": "https://support.google.com/accounts/answer/3024190?visit_id\u003d637432898341218017-3159218066\u0026hl\u003dde\u0026rd\u003d1",
"client_id": "cWD5MfDDekj1z9aA5VeCQQ\u003d\u003d",
"time_usec": 1607693036534748
},
{
"favicon_url": "https://www.google.com/favicon.ico",
"page_transition": "LINK",
"title": "Google \u2013 Meine Aktivitäten",
"url": "https://myactivity.google.com/activitycontrols/webandapp?view\u003ditem",
"client_id": "cWD5MfDDekj1z9aA5VeCQQ\u003d\u003d",
"time_usec": 1607693013403569
},
{
"favicon_url": "https://www.com-magazin.de/favicon.ico",
"page_transition": "LINK",
"title": "Google-Suchverlauf herunterladen und deaktivieren - com! professional",
"url": "https://www.com-magazin.de/news/google/google-suchverlauf-herunterladen-deaktivieren-928063.html#:~:text\u003dUm%20die%20eigenen%20Suchanfragen%20herunterzuladen,Nutzer%20den%20Eintrag%20%22Herunterladen%22.",
"client_id": "cWD5MfDDekj1z9aA5VeCQQ\u003d\u003d",
"time_usec": 1607692994577620
}```
我目前使用的代碼:
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from dhooks import Webhook, Embed
import json
def getHtml(url):
global response
ua = UserAgent()
{'user-agent': ua.random}
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
except Exception as e:
print(e)
return response.content
with open('urls.json', 'r') as history:
json_data = history.read()
data = json.loads(json_data)
for block in data:
print("scraping " + block["url"] + "...")
html = getHtml(json_data)
soup = BeautifulSoup(markup, "html5lib")
text = soup.find_all(text=True)
output = ''
blacklist = [
"style",
"url",
"404",
"ngnix",
"url"
]
for t in text:
if t.parent.name not in blacklist:
output += '{} '.format(t)
with open("{}.txt".format(i), "w") as out_fd:
out_fd.write(output)
如果您的源數據看起來像這樣,
[{
"page_transition": "LINK",
"title": "Niedersachsen nimmt geplante Corona-Lockerungen für Silvester zurück",
"url": "https://www.rnd.de/politik/niedersachsen-nimmt-geplante-corona-lockerungen-fur-silvester-zuruck-IEW2P4XT4M24ZFFZSX7ILE6JGM.html?outputType\u003damp\u0026utm_source\u003dupday\u0026utm_medium\u003dreferral",
"client_id": "59VD9fg/2RVO1jSDxOwfxw\u003d\u003d",
"time_usec": 1607593733981438
},{
"page_transition": "LINK",
"title": "Niedersachsen nimmt geplante Corona-Lockerungen für Silvester zurück",
"url": "https://www.rnd.de/politik/niedersachsen-nimmt-geplante-corona-lockerungen-fur-silvester-zuruck-IEW2P4XT4M24ZFFZSX7ILE6JGM.html?outputType\u003damp\u0026utm_source\u003dupday\u0026utm_medium\u003dreferral",
"client_id": "59VD9fg/2RVO1jSDxOwfxw\u003d\u003d",
"time_usec": 1607593733981438
}, {
"page_transition": "LINK",
"title": "Niedersachsen nimmt geplante Corona-Lockerungen für Silvester zurück",
"url": "https://www.rnd.de/politik/niedersachsen-nimmt-geplante-corona-lockerungen-fur-silvester-zuruck-IEW2P4XT4M24ZFFZSX7ILE6JGM.html?outputType\u003damp\u0026utm_source\u003dupday\u0026utm_medium\u003dreferral",
"client_id": "59VD9fg/2RVO1jSDxOwfxw\u003d\u003d",
"time_usec": 1607593733981438
}]
如果我正確閱讀了您的問題,最好將您的源數據直接解析為 JSON,並使用 'url' 鍵獲取 URL。
with open ('history.json','r') as history:
json_data = history.read()
data = json.loads(json_data)
for k, v in data.items(): #because now your Source data is a dictionary
for block in v: #because v is the list of textblocks
print("scraping " + block["url"] + "...")
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.