简体   繁体   English

如何在python中读取大型.jl文件

[英]How to read a large .jl file in python

I'm trying to read the following dataset and turn it into a pandas dataframe:我正在尝试读取以下数据集并将其转换为 Pandas 数据框:
https://www.kaggle.com/marlesson/meli-data-challenge-2020 https://www.kaggle.com/marlesson/meli-data-challenge-2020

It is a file with lines with the following format:它是一个包含以下格式行的文件:

{'event_info': '...', 'event_timestamp': '...', 'event_type': '...'}
{'event_info': '...', 'event_timestamp': '...', 'event_type': '...'}
{'event_info': '...', 'event_timestamp': '...', 'event_type': '...'}

I've been trying the following but it takes too long (+60min):我一直在尝试以下操作,但花费的时间太长(+60 分钟):

import numpy as np
import pandas as pd
import fileinput
import json

%%time

df = pd.DataFrame()
with fileinput.input(files='/kaggle/input/meli-data-challenge-2020/train_dataset.jl') as file:
    for line in file:
        conv = json.loads(line)
        df = df.append(conv, ignore_index=True)
df.head()

In this code, it reads the file line by line as a string, turns each one of them into json, and then appends it into the dataframe.在这段代码中,它将文件作为字符串逐行读取,将每个文件都转换为 json,然后将其附加到数据帧中。

Is there any way to turn the dataset into a pandas dataframe faster?有什么方法可以更快地将数据集转换为 Pandas 数据框?

The file I was trying to read was a JSON file with multiple objects.我试图读取的文件是一个包含多个对象的 JSON 文件。 Pandas read_json() supports a lines argument for data like this: Pandas read_json()支持这样的数据lines参数:

%%time

df = pd.read_json('/kaggle/input/meli-data-challenge-2020/item_data.jl', lines=True)

Output: CPU times: user 14.1 s, sys: 3.31 s, total: 17.4 s
Wall time: 18.6 s

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM