簡體   English   中英

將 JSON 線拆包到 pandas dataframe

[英]Unpack JSON lines to pandas dataframe

我正在處理 JSON 行格式並嘗試將字典對象“解包”到單個列表中。 由於它使用列表來保存字典 object,因此我之前沒有找到任何處理該問題的帖子。 數據看起來像這樣,在列表 object 中有一堆嵌套字典:

0        [{'created_at': 'Sun Jun 14 20:20:28 +0000 202...
1        [{'created_at': 'Sat Jul 25 22:30:14 +0000 202...
2        [{'created_at': 'Sat May 30 02:22:04 +0000 202...
3        [{'created_at': 'Tue May 05 16:54:05 +0000 202...
4        [{'created_at': 'Sat Jun 20 13:50:23 +0000 202...
                               ...                        
17453    [{'created_at': 'Mon Apr 13 01:01:10 +0000 202...
17454    [{'created_at': 'Fri Jul 17 09:00:50 +0000 202...
17455    [{'created_at': 'Sun Jun 21 00:51:54 +0000 202...
17456    [{'created_at': 'Tue Jun 02 18:23:49 +0000 202...
17457    [{'created_at': 'Thu May 28 00:27:01 +0000 202...

我現在嘗試的是:

with open('data') as file:
    lines = file.read().splitlines()
df_inter = pd.DataFrame(lines)
df_inter.columns = ['json_element']

對於嵌套字典,我將使用本文提供的pd.json_normalize pd.json_normalize(df_inter['json_element'].apply(json.loads)) 但是,我怎樣才能將多個字典對象解壓縮成一行?

編輯

由於數據量很大,我將提供部分單行數據:

[{'created_at': 'Sun Jun 14 20:20:28 +0000 2020', 'id': 1272262651100434433, 'id_str': '1272262651100434433', 'truncated': False, 'display_text_range': [0, 243], 'entities': {'hashtags': [{'text': 'Tenet', 'indices': [82, 88]}], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 1272262640753094656, 'id_str': '1272262640753094656', 'indices': [244, 267], 'media_url': 'http://pbs.twimg.com/media/Eaf8IYsWsAAHVHV.jpg'...}]

如果您的data文件如下所示:

[{"created_at": "Sun Jun 14 20:20:28 +0000 2020", "id": 1272262651100434433, "id_str": "1272262651100434433", "truncated": false, "display_text_range": [0, 243], "entities": {"hashtags": [{"text": "Tenet", "indices": [82, 88]}], "symbols": [], "user_mentions": [], "urls": [], "media": [{"id": 1272262640753094656, "id_str": "1272262640753094656", "indices": [244, 267], "media_url": "http://pbs.twimg.com/media/Eaf8IYsWsAAHVHV.jpg"}]}}]
[{"created_at": "Sun Jun 14 20:20:28 +0000 2020", "id": 1272262651100434433, "id_str": "1272262651100434433", "truncated": false, "display_text_range": [0, 243], "entities": {"hashtags": [{"text": "Tenet", "indices": [82, 88]}], "symbols": [], "user_mentions": [], "urls": [], "media": [{"id": 1272262640753094656, "id_str": "1272262640753094656", "indices": [244, 267], "media_url": "http://pbs.twimg.com/media/Eaf8IYsWsAAHVHV.jpg"}]}}]
[{"created_at": "Sun Jun 14 20:20:28 +0000 2020", "id": 1272262651100434433, "id_str": "1272262651100434433", "truncated": false, "display_text_range": [0, 243], "entities": {"hashtags": [{"text": "Tenet", "indices": [82, 88]}], "symbols": [], "user_mentions": [], "urls": [], "media": [{"id": 1272262640753094656, "id_str": "1272262640753094656", "indices": [244, 267], "media_url": "http://pbs.twimg.com/media/Eaf8IYsWsAAHVHV.jpg"}]}}]
[{"created_at": "Sun Jun 14 20:20:28 +0000 2020", "id": 1272262651100434433, "id_str": "1272262651100434433", "truncated": false, "display_text_range": [0, 243], "entities": {"hashtags": [{"text": "Tenet", "indices": [82, 88]}], "symbols": [], "user_mentions": [], "urls": [], "media": [{"id": 1272262640753094656, "id_str": "1272262640753094656", "indices": [244, 267], "media_url": "http://pbs.twimg.com/media/Eaf8IYsWsAAHVHV.jpg"}]}}]
[{"created_at": "Sun Jun 14 20:20:28 +0000 2020", "id": 1272262651100434433, "id_str": "1272262651100434433", "truncated": false, "display_text_range": [0, 243], "entities": {"hashtags": [{"text": "Tenet", "indices": [82, 88]}], "symbols": [], "user_mentions": [], "urls": [], "media": [{"id": 1272262640753094656, "id_str": "1272262640753094656", "indices": [244, 267], "media_url": "http://pbs.twimg.com/media/Eaf8IYsWsAAHVHV.jpg"}]}}]

您可以使用以下代碼在 jsonl 文件中的每行獲取一個 dataframe 行。

import json
import pandas as pd

with open('data') as f:
    df = pd.DataFrame(json.loads(line)[0] for line in f)

您的 df 將如下所示:

                       created_at                   id               id_str  truncated display_text_range                                           entities
0  Sun Jun 14 20:20:28 +0000 2020  1272262651100434433  1272262651100434433      False           [0, 243]  {'hashtags': [{'text': 'Tenet', 'indices': [82...
1  Sun Jun 14 20:20:28 +0000 2020  1272262651100434433  1272262651100434433      False           [0, 243]  {'hashtags': [{'text': 'Tenet', 'indices': [82...
2  Sun Jun 14 20:20:28 +0000 2020  1272262651100434433  1272262651100434433      False           [0, 243]  {'hashtags': [{'text': 'Tenet', 'indices': [82...
3  Sun Jun 14 20:20:28 +0000 2020  1272262651100434433  1272262651100434433      False           [0, 243]  {'hashtags': [{'text': 'Tenet', 'indices': [82...
4  Sun Jun 14 20:20:28 +0000 2020  1272262651100434433  1272262651100434433      False           [0, 243]  {'hashtags': [{'text': 'Tenet', 'indices': [82...
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   created_at          5 non-null      object
 1   id                  5 non-null      int64
 2   id_str              5 non-null      object
 3   truncated           5 non-null      bool
 4   display_text_range  5 non-null      object
 5   entities            5 non-null      object
dtypes: bool(1), int64(1), object(4)
memory usage: 333.0+ bytes

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM