简体   繁体   English

如何将多个 json 文件读入 pandas 数据框?

[英]How to read multiple json files into pandas dataframe?

I'm having a hard time loading multiple line delimited JSON files into a single pandas dataframe.我很难将多行分隔的 JSON 文件加载到单个熊猫数据框中。 This is the code I'm using:这是我正在使用的代码:

import os, json
import pandas as pd
import numpy as np
import glob
pd.set_option('display.max_columns', None)

temp = pd.DataFrame()

path_to_json = '/Users/XXX/Desktop/Facebook Data/*' 

json_pattern = os.path.join(path_to_json,'*.json')
file_list = glob.glob(json_pattern)

for file in file_list:
    data = pd.read_json(file, lines=True)
    temp.append(data, ignore_index = True)

It looks like all the files are loading when I look through file_list , but cannot figure out how to get each file into a dataframe.当我查看file_list时,看起来所有文件都在加载,但无法弄清楚如何将每个文件放入数据框中。 There are about 50 files with a couple lines in each file.大约有 50 个文件,每个文件中有几行。

Change the last line to:将最后一行更改为:

temp = temp.append(data, ignore_index = True)

The reason we have to do this is because the append doesn't happen in place.我们必须这样做的原因是因为追加没有就地发生。 The append method does not modify the data frame. append 方法不会修改数据框。 It just returns a new data frame with the result of the append operation.它只是返回一个带有追加操作结果的新数据帧。

Edit:编辑:

Since writing this answer I have learned that you should never use DataFrame.append inside a loop because it leads to quadratic copying (see this answer ).自从写了这个答案后,我了解到你永远不应该在循环中使用DataFrame.append因为它会导致二次复制(参见这个答案)。

What you should do instead is first create a list of data frames and then use pd.concat to concatenate them all in a single operation.您应该做的是首先创建一个数据框列表,然后使用pd.concat在单个操作pd.concat它们全部连接起来。 Like this:像这样:

dfs = [] # an empty list to store the data frames
for file in file_list:
    data = pd.read_json(file, lines=True) # read data frame from json file
    dfs.append(data) # append the data frame to the list

temp = pd.concat(dfs, ignore_index=True) # concatenate all the data frames in the list.

This alternative should be considerably faster.这种替代方法应该快得多。

If you need to flatten the JSON, Juan Estevez's approach won't work as is.如果您需要展平 JSON,Juan Estevez 的方法将无法按原样工作。 Here is an alternative :这是一个替代方案:

import pandas as pd

dfs = []
for file in file_list:
    with open(file) as f:
        json_data = pd.json_normalize(json.loads(f.read()))
    dfs.append(json_data)
df = pd.concat(dfs, sort=False) # or sort=True depending on your needs

Or if your JSON are line-delimited (not tested) :或者,如果您的 JSON 是行分隔的(未测试):

import pandas as pd

dfs = []
for file in file_list:
    with open(file) as f:
        for line in f.readlines():
            json_data = pd.json_normalize(json.loads(line))
            dfs.append(json_data)
df = pd.concat(dfs, sort=False) # or sort=True depending on your needs
from pathlib import Path
import pandas as pd

paths = Path("/home/data").glob("*.json")
df = pd.DataFrame([pd.read_json(p, typ="series") for p in paths])```

Maybe you should state, if the json files are created themselves with pandas pd.to_json() or in another way.也许您应该说明,如果 json 文件是使用 pandas pd.to_json() 或其他方式创建的。 I used data which was not created with pd.to_json() and I think it is not pssible to use pd.read_json() in my case.我使用了不是用 pd.to_json() 创建的数据,我认为在我的情况下使用 pd.read_json() 是不可行的。 Instead, I programmed a customized for-each loop approach to write everything to the DataFrames相反,我编写了一个自定义的 for-each 循环方法来将所有内容写入 DataFrame

I combined Juan Estevez's answer with glob.我将 Juan Estevez 的回答与 glob 结合起来。 Thanks a lot.非常感谢。

import pandas as pd
import glob

def readFiles(path):
    files = glob.glob(path)
    dfs = [] # an empty list to store the data frames
    for file in files:
        data = pd.read_json(file, lines=True) # read data frame from json file
        dfs.append(data) # append the data frame to the list

    df = pd.concat(dfs, ignore_index=True) # concatenate all the data frames in the list.
    return df

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM