简体   繁体   English

导出到连续数据的 JSON 文件

[英]Exporting to JSON file of a continuous data

I write a script for web scraping and it is successfully scraping data.我为网页抓取编写了一个脚本,它成功地抓取了数据。 Only problem is with exporting data to JSON file唯一的问题是将数据导出到 JSON 文件

def scrape_post_info(url):
    content = get_page_content(url)
    title, description, post_url = get_post_details(content, url)
    job_dict = {}
    job_dict['title'] = title
    job_dict['Description'] = description
    job_dict['url'] = post_url

    #here json machanism
    json_job = json.dumps(job_dict)
    with open('data.json', 'r+') as f:
        f.write("[")
        f.seek(0)
        f.write(json_job)
        txt = f.readline()
        if txt.endswith("}"):
            f.write(",")

def crawl_web(url):
    while True:
        post_url = get_post_url(url)
        for urls in post_url:
            urls = urls
            scrape_post_info(urls)

# Execute the main fuction 'crawl_web'
if __name__ == '__main__':
    crawl_web('www.examp....com')

The data is exported to JSON but it is not proper format of JSON.数据导出为 JSON,但它不是 JSON 的正确格式。 I am expecting the data should look like:我期待数据应该是这样的:

[
{
    "title": "this is title",
    "Description": " Fendi is an Italian luxury labelarin. ",
    "url": "https:/~"
},

{
    "title": " - Furrocious Elegant Style", 
    "Description": " the Italian luxare vast. ", 
    "url": "https://www.s"
},

{
    "title": "Rome, Fountains and Fendi Sunglasses",
    "Description": " Fendi started off as a store. ",
    "url": "https://www.~"
},

{
    "title": "Tipsnglasses",
    "Description": "Whether irregular orn season.", 
    "url": "https://www.sooic"
},

]

How can I achieve this?我怎样才能做到这一点?

How about:怎么样:

def scrape_post_info(url):
    content = get_page_content(url)
    title, description, post_url = get_post_details(content, url)
    return {"title": title, "Description": description, "url": post_url}


def crawl_web(url):
    while True:
        jobs = []
        post_urls = get_post_url(url)
        for url in post_urls:
            jobs.append(scrape_post_info(url))
            with open("data.json", "w") as f:
                json.dumps(jobs)


# Execute the main fuction 'crawl_web'
if __name__ == "__main__":
    crawl_web("www.examp....com")

Note that this will rewrite your entire file on each iteration of "post_urls", so it might become quite slow with large files and slow I/O.请注意,这将在“post_urls”的每次迭代中重写整个文件,因此对于大文件和缓慢的 I/O,它可能会变得非常慢。

Depending on how long your job is running and how much memory you have, you might want to move the file writing out of the for loop, and only write it out once.根据您的作业运行时间和您拥有的内存量,您可能希望将文件写出 for 循环,并且只写出一次。

Note: if you really want to write JSON streaming, you might want to look at something like this package: https://pypi.org/project/jsonstreams/ , however I'd suggest to choose another format such as CSV that is much more well suited to streaming writes.注意:如果你真的想写 JSON 流,你可能想看看这个包: https : //pypi.org/project/jsonstreams/ ,但是我建议选择另一种格式,比如 CSV更适合流式写入。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM