導出到連續數據的 JSON 文件

Question

我為網頁抓取編寫了一個腳本，它成功地抓取了數據。 唯一的問題是將數據導出到 JSON 文件

def scrape_post_info(url):
    content = get_page_content(url)
    title, description, post_url = get_post_details(content, url)
    job_dict = {}
    job_dict['title'] = title
    job_dict['Description'] = description
    job_dict['url'] = post_url

    #here json machanism
    json_job = json.dumps(job_dict)
    with open('data.json', 'r+') as f:
        f.write("[")
        f.seek(0)
        f.write(json_job)
        txt = f.readline()
        if txt.endswith("}"):
            f.write(",")

def crawl_web(url):
    while True:
        post_url = get_post_url(url)
        for urls in post_url:
            urls = urls
            scrape_post_info(urls)

# Execute the main fuction 'crawl_web'
if __name__ == '__main__':
    crawl_web('www.examp....com')

數據導出為 JSON，但它不是 JSON 的正確格式。 我期待數據應該是這樣的：

[
{
    "title": "this is title",
    "Description": " Fendi is an Italian luxury labelarin. ",
    "url": "https:/~"
},

{
    "title": " - Furrocious Elegant Style", 
    "Description": " the Italian luxare vast. ", 
    "url": "https://www.s"
},

{
    "title": "Rome, Fountains and Fendi Sunglasses",
    "Description": " Fendi started off as a store. ",
    "url": "https://www.~"
},

{
    "title": "Tipsnglasses",
    "Description": "Whether irregular orn season.", 
    "url": "https://www.sooic"
},

]

我怎樣才能做到這一點？

Answer 1

怎么樣：

def scrape_post_info(url):
    content = get_page_content(url)
    title, description, post_url = get_post_details(content, url)
    return {"title": title, "Description": description, "url": post_url}


def crawl_web(url):
    while True:
        jobs = []
        post_urls = get_post_url(url)
        for url in post_urls:
            jobs.append(scrape_post_info(url))
            with open("data.json", "w") as f:
                json.dumps(jobs)


# Execute the main fuction 'crawl_web'
if __name__ == "__main__":
    crawl_web("www.examp....com")

請注意，這將在“post_urls”的每次迭代中重寫整個文件，因此對於大文件和緩慢的 I/O，它可能會變得非常慢。

根據您的作業運行時間和您擁有的內存量，您可能希望將文件寫出 for 循環，並且只寫出一次。

注意：如果你真的想寫 JSON 流，你可能想看看這個包： https : //pypi.org/project/jsonstreams/ ，但是我建議選擇另一種格式，比如 CSV更適合流式寫入。

導出到連續數據的 JSON 文件

問題描述

1 個解決方案

解決方案1
0 已采納 2019-07-27 08:11:59

導出到連續數據的 JSON 文件

問題描述

1 個解決方案

解決方案1 0 已采納 2019-07-27 08:11:59

解決方案1
0 已采納 2019-07-27 08:11:59