[英]Exporting to JSON file of a continuous data
我為網頁抓取編寫了一個腳本,它成功地抓取了數據。 唯一的問題是將數據導出到 JSON 文件
def scrape_post_info(url):
content = get_page_content(url)
title, description, post_url = get_post_details(content, url)
job_dict = {}
job_dict['title'] = title
job_dict['Description'] = description
job_dict['url'] = post_url
#here json machanism
json_job = json.dumps(job_dict)
with open('data.json', 'r+') as f:
f.write("[")
f.seek(0)
f.write(json_job)
txt = f.readline()
if txt.endswith("}"):
f.write(",")
def crawl_web(url):
while True:
post_url = get_post_url(url)
for urls in post_url:
urls = urls
scrape_post_info(urls)
# Execute the main fuction 'crawl_web'
if __name__ == '__main__':
crawl_web('www.examp....com')
數據導出為 JSON,但它不是 JSON 的正確格式。 我期待數據應該是這樣的:
[
{
"title": "this is title",
"Description": " Fendi is an Italian luxury labelarin. ",
"url": "https:/~"
},
{
"title": " - Furrocious Elegant Style",
"Description": " the Italian luxare vast. ",
"url": "https://www.s"
},
{
"title": "Rome, Fountains and Fendi Sunglasses",
"Description": " Fendi started off as a store. ",
"url": "https://www.~"
},
{
"title": "Tipsnglasses",
"Description": "Whether irregular orn season.",
"url": "https://www.sooic"
},
]
我怎樣才能做到這一點?
怎么樣:
def scrape_post_info(url):
content = get_page_content(url)
title, description, post_url = get_post_details(content, url)
return {"title": title, "Description": description, "url": post_url}
def crawl_web(url):
while True:
jobs = []
post_urls = get_post_url(url)
for url in post_urls:
jobs.append(scrape_post_info(url))
with open("data.json", "w") as f:
json.dumps(jobs)
# Execute the main fuction 'crawl_web'
if __name__ == "__main__":
crawl_web("www.examp....com")
請注意,這將在“post_urls”的每次迭代中重寫整個文件,因此對於大文件和緩慢的 I/O,它可能會變得非常慢。
根據您的作業運行時間和您擁有的內存量,您可能希望將文件寫出 for 循環,並且只寫出一次。
注意:如果你真的想寫 JSON 流,你可能想看看這個包: https : //pypi.org/project/jsonstreams/ ,但是我建議選擇另一種格式,比如 CSV更適合流式寫入。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.