如何将多行 JSON 文件转换为 Dataframe

Question

I'm using an instagram scraper that outputs a multi-row JSON file and I'd like to select certain values from that file and assign them to a DataFrame.我正在使用输出多行 JSON 文件的 instagram 抓取器，我想从该文件中选择某些值并将它们分配给 DataFrame。

When I try to use panda's pd.read_json it only saves the first level to each dataframe.当我尝试使用熊猫的 pd.read_json 时，它只将第一级保存到每个数据帧。

As an example, I'd like to have a dataframe with the first row that contains (JSON variable in parenthesis):例如，我想要一个数据框，第一行包含（括号中的 JSON 变量）：

Likes ("edge_media_preview_like": {"count": 1356...)喜欢 ("edge_media_preview_like": {"count": 1356 ...)
Comment Count ("edge_media_to_comment": {"count": 44)评论计数 ("edge_media_to_comment": {"count": 44)

The JSON file looks like this: JSON 文件如下所示：

{
    "GraphImages": [
        {
            "__typename": "GraphImage",
            "comments_disabled": false,
            "dimensions": {
                "height": 770,
                "width": 1080
            },
            "display_url": "https:abc123.com",
            "edge_media_preview_like": {
                "count": 1356
            },
            "edge_media_to_caption": {
                "edges": [
                    {
                        "node": {
                            "text": "TEXT EXAMPLE 123"
                        }
                    }
                ]
            },
            "edge_media_to_comment": {
                "count": 44
            },
            "gating_info": null,
            "id": "2219687023504340370",
            "is_video": false,
            "media_preview": "abc123media",
            "owner": {
                "id": "212343915"
            },
            "shortcode": "B7N6ZZkhTWS",
            "tags": [],
            "taken_at_timestamp": 1578827334,
            "thumbnail_resources": [
                {
                    "config_height": 150,
                    "config_width": 150,
                    "src": "abc123.com"
                },
                {
                    "config_height": 240,
                    "config_width": 240,
                    "src": "abc123.com"
                },
                {
                    "config_height": 320,
                    "config_width": 320,
                    "src": "https://abc123.com"
                },
                {
                    "config_height": 480,
                    "config_width": 480,
                    "src": "https:/abc123.com"
                },
                {
                    "config_height": 640,
                    "config_width": 640,
                    "src": "https://abc123.com"
                }
            ],
            "thumbnail_src": "https://abc123.com",
            "urls": [
                "https://abc123.com"
            ],
            "username": "abc123"
        }
    ]
}

I'm looking for:我在找：

    ImageNumber Likes   CommentCount
0   1           1356    44
1   ...         ...     ...

Thank you!谢谢！

Adding wrong result when using pd.read_json:使用 pd.read_json 时添加错误结果：

    GraphImages
0   {'__typename': 'GraphImage', 'comments_disable...
1   {'__typename': 'GraphImage', 'comments_disable...
2   {'__typename': 'GraphImage', 'comments_disable...
3   {'__typename': 'GraphImage', 'comments_disable...

Answer 1

The following should work,以下应该工作，

import json
with open('ig.json') as json_file:
    dct = json.load(json_file)

df = pd.io.json.json_normalize(dct, record_path="GraphImages")[["edge_media_preview_like.count", "edge_media_to_comment.count"]].rename({"edge_media_preview_like.count":"Likes", "edge_media_to_comment.count": "CommentCount"}, axis=1)
df["ImageNumber"] = df.index + 1

Which produces,其中产生，


    Likes   CommentCount    ImageNumber
0   1356    44              1

I'm not sure where the ImageNumber is coming from.我不确定ImageNumber来自哪里。 But I assume it's the order of items appear in GraphImages .但我认为这是GraphImages出现的项目GraphImages 。 If so, df.Index + 1 would give you that.如果是这样， df.Index + 1会给你。

Answer 2

I found the answer.我找到了答案。 Turns out that this instagram-scraper outputs a JSON file that is composed of a dictionary, within a list, within a dictionary.事实证明，这个 instagram-scraper 输出了一个 JSON 文件，该文件由一个字典、一个列表、一个字典组成。 The code to extract is as follows:提取的代码如下：

import json
import pandas as pd

with open('ig.json') as json_file:
    data = json.load(json_file)

data['Likes'] = data['GraphImages'][0]['edge_media_preview_like']['count']
...

I hope that can help people in the future!我希望可以帮助未来的人！

如何将多行 JSON 文件转换为 Dataframe

问题描述

2 个解决方案

解决方案1
1 2020-01-13 04:10:53

解决方案2
0 已采纳 2020-01-13 21:46:08

如何将多行 JSON 文件转换为 Dataframe

问题描述

2 个解决方案

解决方案1 1 2020-01-13 04:10:53

解决方案2 0 已采纳 2020-01-13 21:46:08

解决方案1
1 2020-01-13 04:10:53

解决方案2
0 已采纳 2020-01-13 21:46:08