简体   繁体   English

如何将结构不同的多个 JSON 文件转换为单个 pandas dataframe?

[英]How do I convert multiple JSON files with unidentical structure to a single pandas dataframe?

The input is many JSON files differing in structure, and the desired output is a single dataframe.输入是许多结构不同的 JSON 文件,所需的 output 是单个 dataframe。

Input Description:输入说明:

Each JSON file may have 1 or many attackers and exactly 1 victim.每个JSON 文件可能有 1 个或多个攻击者和 1 个受害者。 The attackers key points to a list of dictionaries. attackers将密钥指向字典列表。 Each dictionary is 1 attacker with keys such as character_id , corporation_id , alliance_id , etc. The victim key points to dictionary with similar keys.每个字典都是 1 个攻击者,具有character_idcorporation_idalliance_id等键。 victim键指向具有相似键的字典。 Important thing to note here is that the keys might differ between the same JSON. For example, a JSON file may have attackers key which looks like this:这里需要注意的重要一点是,相同的 JSON 之间的密钥可能不同。例如,一个 JSON 文件可能具有如下所示的attackers密钥:

{
    "attackers": [
        {
            "alliance_id": 99005678,
            "character_id": 94336577,
            "corporation_id": 98224639,
            "damage_done": 3141,
            "faction_id": 500003,
            "final_blow": true,
            "security_status": -9.4,
            "ship_type_id": 73796,
            "weapon_type_id": 3178
        },
        {
            "damage_done": 1614,
            "faction_id": 500003,
            "final_blow": false,
            "security_status": 0,
            "ship_type_id": 32963
        }
    ],
...

Here the JSON file has 2 attackers.这里的 JSON 文件有 2 个攻击者。 But only the first attacker has the afore-mentioned keys.但只有第一个攻击者拥有上述密钥。 Similarly, the victim may look like this:同样, victim可能看起来像这样:

...
"victim": {
        "character_id": 2119076173,
        "corporation_id": 98725195,
        "damage_taken": 4755,
        "faction_id": 500002,
        "items": [...
...

Output Description: Output 说明:

As an output I want to create a dataframe from many (about 400,000) such JSON files stored in the same directory.作为一个 output,我想从存储在同一目录中的许多(大约 400,000 个)这样的 JSON 文件创建一个 dataframe。 Each row of the resulting dataframe should have 1 attacker and 1 victim.结果 dataframe 的每一行都应该有 1 名攻击者和 1 名受害者。 JSONs with multiple attackers should be split into equal number of rows, where the attackers' properties are different, but the victim properties are the same.具有多个攻击者的 JSON 应拆分为相同数量的行,其中攻击者的属性不同,但受害者属性相同。 For eg, 3 rows if there are 3 attackers and NaN values where a certain attacker doesn't have a key-value pair.例如,如果有 3 个攻击者和NaN值,则为 3 行,其中某个攻击者没有键值对。 So, the character_id for the second attacker in the dataframe of the above example should be NaN .因此,上例 dataframe 中第二个攻击者的character_id应该是NaN

Current Method:当前方法:

To achieve this, I first create an empty list.为此,我首先创建一个空列表。 Then iterate through all the files, open them, load them as JSON objects, convert to dataframe then append dataframe to the list.然后遍历所有文件,打开它们,将它们加载为 JSON 对象,转换为 dataframe 然后 append dataframe 到列表中。 Please note that pd.DataFrame([json.load(fi)]) has the same output as pd.json_normalize(json.load(fi)) .请注意, pd.DataFrame([json.load(fi)])pd.json_normalize(json.load(fi)) ) 具有相同的 output。

mainframe = []

for file in tqdm(os.listdir("D:/Master/killmails_jul"), ncols=100, ascii='          >'):
    with open("%s/%s" % ("D:/Master/killmails_jul", file),'r') as fi:
        mainframe.append(pd.DataFrame([json.load(fi)]))

After this loop, I am left with a list of dataframes which I concatenate using pd.concat() .在这个循环之后,我留下了一个数据帧列表,我使用pd.concat()连接它们。

mainframe = pd.concat(mainframe)

As of yet, the dataframe only has 1 row per JSON irrespective of the number of attackers .到目前为止,dataframe 每个 JSON 只有 1 行,与attackers的数量无关。 To fix this, I use pd.explode() in the next step.为了解决这个问题,我在下一步中使用pd.explode()

mainframe = mainframe.explode('attackers')
mainframe.reset_index(drop=True, inplace=True)

Now I have separate rows for each attacker, however the attackers & victim keys are still hidden in their respective column.现在每个攻击者都有单独的行,但是attackersvictim密钥仍然隐藏在各自的列中。 To fix this I 'explode' the the two columns horizontally by pd.apply(pd.Series) and apply prefix for easy recognition as follows:为了解决这个问题,我通过pd.apply(pd.Series)水平“分解”了两列,并应用前缀以便于识别,如下所示:

intframe = mainframe["attackers"].apply(pd.Series).add_prefix("attackers_").join(mainframe["victim"].apply(pd.Series).add_prefix("victim_"))

In the next step I join this intermediate frame with the mainframe to retain the killmail_id and killmail_hash columns.在下一步中,我将这个中间框架与大型机结合起来以保留killmail_idkillmail_hash列。 Then remove the attackers & victim columns as I have now expanded them.然后删除attackersvictim列,因为我现在已经展开了它们。

mainframe = intframe.join(mainframe)
mainframe.fillna(0, inplace=True)
mainframe.drop(['attackers','victim'], axis=1, inplace=True)

This gives me the desired output with the following 24 columns:这给了我所需的 output 和以下 24 列:

['attackers_character_id', 'attackers_corporation_id', 'attackers_damage_done', 'attackers_final_blow', 'attackers_security_status', 'attackers_ship_type_id', 'attackers_weapon_type_id', 'attackers_faction_id', 'attackers_alliance_id', 'victim_character_id', 'victim_corporation_id', 'victim_damage_taken', 'victim_items', 'victim_position', 'victim_ship_type_id', 'victim_alliance_id', 'victim_faction_id', 'killmail_id', 'killmail_time', 'solar_system_id', 'killmail_hash', 'http_last_modified', 'war_id', 'moon_id'] ['attackers_character_id', 'attackers_corporation_id', 'attackers_damage_done', 'attackers_final_blow', 'attackers_security_status', 'attackers_ship_type_id', 'attackers_weapon_type_id', 'attackers_faction_id', 'attackers_alliance_id', 'victim_character_id', 'victim_corporation_agen_id', victim_items', 'victim_position', 'victim_ship_type_id', 'victim_alliance_id', 'victim_faction_id', 'killmail_id', 'killmail_time', 'solar_system_id', 'killmail_hash', 'http_last_modified', 'war_id', 'moon_id']

Question:问题:

Is there a better way to do this than I am doing right now?有没有比我现在做的更好的方法? I tried to use generators but couldn't get them to work.我尝试使用发电机,但无法让它们工作。 I get an AttributeError: 'str' object has no attribute 'read'我得到一个AttributeError: 'str' object has no attribute 'read'

all_files_paths = glob(os.path.join('D:\\Master\\kmrest', '*.json'))

def gen_df(files):
    for file in files:
        with open(file, 'r'):
            data = json.load(file)
        data = pd.DataFrame([data])
        yield data

mainframe = pd.concat(gen_df(all_files_paths), ignore_index=True)

Will using the pd.concat() function with generators lead to quadratic copying?pd.concat() function 与生成器一起使用会导致二次复制吗? Also, I am worried opening and closing many files is slowing down computation.另外,我担心打开和关闭许多文件会减慢计算速度。 Maybe it would be better to create a JSONL file from all the JSONs first and then creating a dataframe for each line.也许最好先从所有 JSON 创建一个 JSONL 文件,然后为每一行创建一个 dataframe。

If you'd like to get your hands on the files, I am trying to work with you can click here .如果您想要获取这些文件,我正在尝试与您合作,请单击此处 Let me know if further information is needed.让我知道是否需要更多信息。

You could use pd.json_normalize() to help with the heavy lifting:您可以使用pd.json_normalize()来帮助完成繁重的工作:

First, load your data:首先,加载您的数据:

import json
import requests
import tarfile
from tqdm.notebook import tqdm

url = 'https://data.everef.net/killmails/2022/killmails-2022-11-22.tar.bz2'
with requests.get(url, stream=True) as r:
    fobj = io.BytesIO(r.raw.read())
    with tarfile.open(fileobj=fobj, mode='r:bz2') as tar:
        json_files = [it for it in tar if it.name.endswith('.json')]
        data = [json.load(tar.extractfile(it)) for it in tqdm(json_files)]

To do the same with your files:对你的文件做同样的事情:

import json
from glob import glob

def json_load(filename):
    with open(filename) as f:
        return json.load(f)

topdir = '...'  # the dir containing all your json files
data = [json_load(fn) for fn in tqdm(glob(f'{topdir}/*.json'))]

Once you have a list of dicts in data :一旦你在data中有了一个字典列表:

others = ['killmail_id', 'killmail_hash']
a = pd.json_normalize(data, 'attackers', others, record_prefix='attackers.')
v = pd.json_normalize(data).drop('attackers', axis=1)
df = a.merge(v, on=others)

Some quick inspection:一些快速检查:

>>> df.shape
(44903, 26)

# check:
>>> sum([len(d['attackers']) for d in data])
44903

>>> df.columns
Index(['attackers.alliance_id', 'attackers.character_id',
       'attackers.corporation_id', 'attackers.damage_done',
       'attackers.final_blow', 'attackers.security_status',
       'attackers.ship_type_id', 'attackers.weapon_type_id',
       'attackers.faction_id', 'killmail_id', 'killmail_hash', 'killmail_time',
       'solar_system_id', 'http_last_modified', 'victim.alliance_id',
       'victim.character_id', 'victim.corporation_id', 'victim.damage_taken',
       'victim.items', 'victim.position.x', 'victim.position.y',
       'victim.position.z', 'victim.ship_type_id', 'victim.faction_id',
       'war_id', 'moon_id'],
      dtype='object')

>>> df.iloc[:5, :5]
   attackers.alliance_id  attackers.character_id  attackers.corporation_id  attackers.damage_done  attackers.final_blow
0  99007887.0             1.450608e+09            2.932806e+08              1426                   False               
1  99010931.0             1.628193e+09            5.668252e+08              1053                   False               
2  99007887.0             1.841341e+09            1.552312e+09              1048                   False               
3  99007887.0             2.118406e+09            9.872458e+07               662                   False               
4  99005839.0             9.573650e+07            9.947834e+08               630                   False               

>>> df.iloc[-5:, -5:]
       victim.position.z  victim.ship_type_id  victim.faction_id  war_id  moon_id
44898  1.558110e+11       670                 NaN                NaN     NaN     
44899 -7.678686e+10       670                 NaN                NaN     NaN     
44900 -7.678686e+10       670                 NaN                NaN     NaN     
44901 -7.678686e+10       670                 NaN                NaN     NaN     
44902 -7.678686e+10       670                 NaN                NaN     NaN     

Note also that, as desired, missing keys for attackers are NaN :另请注意,根据需要,攻击者丢失的密钥是NaN

>>> df.iloc[15:20, :2]
    attackers.alliance_id  attackers.character_id
15  99007887.0             2.117497e+09          
16  99011893.0             1.593514e+09          
17         NaN             9.175132e+07          
18         NaN             2.119191e+09          
19  99011258.0             1.258332e+09          

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM