簡體   English   中英

解析巨大的 JSON 文件時出現 Memory 錯誤

[英]Memory error while parsing huge JSON file

我正在嘗試在 python 中解析一個包含近 500 萬行(每行都是一個對象)的 12 GB JSON 巨大文件並將其存儲到數據庫中。 我正在使用ijson和多處理以更快地運行它。 這是代碼

def parse(paper):
    global mydata
 
    if 'type' not in paper["venue"]:
        venue = Venues(venue_raw = paper["venue"]["raw"])
        venue.save()
    else:
        venue = Venues(venue_raw = paper["venue"]["raw"], venue_type = paper["venue"]["type"])
        venue.save()
    paper1 = Papers(paper_id = paper["id"],paper_title = paper["title"],venue = venue)
    paper1.save()
            
    paper_authors = paper["authors"]
    paper_authors_json = json.dumps(paper_authors)
    obj = ijson.items(paper_authors_json,'item')
    for author in obj:
        mydata = mydata.append({'author_id': author["id"] , 'venue_raw': venue.venue_raw, 'year' : paper["year"],'number_of_times': 1},ignore_index=True)

if __name__ == '__main__':
    p = Pool(4)
 
    filename = 'C:/Users/dintz/Documents/finaldata/dblp.v12.json'
    with open(filename,encoding='UTF-8') as infile:
        papers = ijson.items(infile, 'item')   
        for paper in papers:
            p.apply_async(parse,(paper,))
    
            
    
    p.close()
    p.join()
            
    
 
    mydata = mydata.groupby(by=['author_id','venue_raw','year'], axis=0, as_index = False).sum()
    mydata = mydata.groupby(by = ['author_id','venue_raw'], axis=0, as_index = False, group_keys = False).apply(lambda x: sum((1+x.year-x.year.min())*numpy.log10(x.number_of_times+1)))
    df = mydata.index.to_frame(index = False)
    df = pd.DataFrame({'author_id':df["author_id"],'venue_raw':df["venue_raw"],'rating':mydata.values[:,2]})
    
    for index, row in df.iterrows():
        author_id = row['author_id']
        venue = Venues.objects.get(venue_raw = row['venue_raw'])
        rating = Ratings(author_id = author_id, venue = venue, rating = row['rating'])
        rating.save()

但是我在不知道原因的情況下收到以下錯誤在此處輸入圖像描述

有人可以幫助我嗎?

我不得不做出相當多的推斷和假設,但看起來

  • 你正在使用 Django
  • 你想用場地、論文和作者數據填充 SQL 數據庫
  • 然后你想用 Pandas 做一些分析

填充您的 SQL 數據庫可以通過以下方式非常巧妙地完成。

  • 我添加了tqdm package 以便您獲得進度指示。
  • 這假設有一個鏈接論文和作者的PaperAuthor model。
  • 與原始代碼不同,這不會在數據庫中保存重復的Venue
  • 你可以看到我用 stubs 替換get_or_createcreate以使它可以在沒有數據庫模型(或者實際上,沒有 Django)的情況下運行,只需要你使用的數據集可用。

On my machine, this consumes practically no memory, as the records are (or would be) dumped into the SQL database, not into an ever-growing, fragmenting dataframe in memory.

Pandas 處理留給讀者作為練習;-),但我想它會涉及pd.read_sql()從數據庫中讀取這些預處理數據。

import multiprocessing

import ijson
import tqdm


def get_or_create(model, **kwargs):
    # Actual Django statement:
    # return model.objects.get_or_create(**kwargs)
    return (None, True)


def create(model, **kwargs):
    # Actual Django statement:
    # return model.objects.create(**kwargs)
    return None


Venue = "Venue"
Paper = "Paper"
PaperAuthor = "PaperAuthor"


def parse(paper):
    venue_name = paper["venue"]["raw"]
    venue_type = paper["venue"].get("type")
    venue, _ = get_or_create(Venue, venue_raw=venue_name, venue_type=venue_type)
    paper_obj = create(Paper, paper_id=paper["id"], paper_title=paper["title"], venue=venue)
    for author in paper["authors"]:
        create(PaperAuthor, paper=paper_obj, author_id=author["id"], year=paper["year"])


def main():
    filename = "F:/dblp.v12.json"
    with multiprocessing.Pool() as p, open(filename, encoding="UTF-8") as infile:
        for result in tqdm.tqdm(p.imap_unordered(parse, ijson.items(infile, "item"), chunksize=64)):
            pass


if __name__ == "__main__":
    main()

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM