[英]Memory error while parsing huge JSON file
我正在嘗試在 python 中解析一個包含近 500 萬行(每行都是一個對象)的 12 GB JSON 巨大文件並將其存儲到數據庫中。 我正在使用ijson和多處理以更快地運行它。 這是代碼
def parse(paper):
global mydata
if 'type' not in paper["venue"]:
venue = Venues(venue_raw = paper["venue"]["raw"])
venue.save()
else:
venue = Venues(venue_raw = paper["venue"]["raw"], venue_type = paper["venue"]["type"])
venue.save()
paper1 = Papers(paper_id = paper["id"],paper_title = paper["title"],venue = venue)
paper1.save()
paper_authors = paper["authors"]
paper_authors_json = json.dumps(paper_authors)
obj = ijson.items(paper_authors_json,'item')
for author in obj:
mydata = mydata.append({'author_id': author["id"] , 'venue_raw': venue.venue_raw, 'year' : paper["year"],'number_of_times': 1},ignore_index=True)
if __name__ == '__main__':
p = Pool(4)
filename = 'C:/Users/dintz/Documents/finaldata/dblp.v12.json'
with open(filename,encoding='UTF-8') as infile:
papers = ijson.items(infile, 'item')
for paper in papers:
p.apply_async(parse,(paper,))
p.close()
p.join()
mydata = mydata.groupby(by=['author_id','venue_raw','year'], axis=0, as_index = False).sum()
mydata = mydata.groupby(by = ['author_id','venue_raw'], axis=0, as_index = False, group_keys = False).apply(lambda x: sum((1+x.year-x.year.min())*numpy.log10(x.number_of_times+1)))
df = mydata.index.to_frame(index = False)
df = pd.DataFrame({'author_id':df["author_id"],'venue_raw':df["venue_raw"],'rating':mydata.values[:,2]})
for index, row in df.iterrows():
author_id = row['author_id']
venue = Venues.objects.get(venue_raw = row['venue_raw'])
rating = Ratings(author_id = author_id, venue = venue, rating = row['rating'])
rating.save()
有人可以幫助我嗎?
我不得不做出相當多的推斷和假設,但看起來
填充您的 SQL 數據庫可以通過以下方式非常巧妙地完成。
tqdm
package 以便您獲得進度指示。PaperAuthor
model。Venue
。get_or_create
和create
以使它可以在沒有數據庫模型(或者實際上,沒有 Django)的情況下運行,只需要你使用的數據集可用。On my machine, this consumes practically no memory, as the records are (or would be) dumped into the SQL database, not into an ever-growing, fragmenting dataframe in memory.
Pandas 處理留給讀者作為練習;-),但我想它會涉及pd.read_sql()
從數據庫中讀取這些預處理數據。
import multiprocessing
import ijson
import tqdm
def get_or_create(model, **kwargs):
# Actual Django statement:
# return model.objects.get_or_create(**kwargs)
return (None, True)
def create(model, **kwargs):
# Actual Django statement:
# return model.objects.create(**kwargs)
return None
Venue = "Venue"
Paper = "Paper"
PaperAuthor = "PaperAuthor"
def parse(paper):
venue_name = paper["venue"]["raw"]
venue_type = paper["venue"].get("type")
venue, _ = get_or_create(Venue, venue_raw=venue_name, venue_type=venue_type)
paper_obj = create(Paper, paper_id=paper["id"], paper_title=paper["title"], venue=venue)
for author in paper["authors"]:
create(PaperAuthor, paper=paper_obj, author_id=author["id"], year=paper["year"])
def main():
filename = "F:/dblp.v12.json"
with multiprocessing.Pool() as p, open(filename, encoding="UTF-8") as infile:
for result in tqdm.tqdm(p.imap_unordered(parse, ijson.items(infile, "item"), chunksize=64)):
pass
if __name__ == "__main__":
main()
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.