简体   繁体   English

如何从 Python Pandas Dataframe 的 STRING 列中提取嵌套字典?

[英]How to extract a nested dictionary from a STRING column in Python Pandas Dataframe?

There's a table where one data point of its column event looks like this:有一个表,其中列event一个数据点如下所示:

THE 'event IS A STRING COLUMN! '事件是一个字符串列!

df['event']
RETURNS:
"{'eventData': {'type': 'page', 'name': "WHAT'S UP"}, 'eventId': '1003', 'deviceType': 'kk', 'pageUrl': '/chick 2/whats sup', 'version': '1.0.0.888-10_7_2020__4_18_30', 'sessionGUID': '1b312346a-cd26-4ce6-888-f25143030e02', 'locationid': 'locakdi-3b0c-49e3-ab64-741f07fd4cb3', 'eventDescription': 'Page Load'}"

I'm trying to extract the nested dictionary eventData from the dictionary and create a new column like below:我正在尝试从字典中提取嵌套字典eventData并创建一个如下所示的新列:

df['event'] 
RETURNS: 
{'eventId': '1003', 'deviceType': 'kk', 'pageUrl': '/chick 2/whats sup', 'version': '1.0.0.888-10_7_2020__4_18_30', 'sessionGUID': '1b312346a-cd26-4ce6-888-f25143030e02', 'locationid': 'locakdi-3b0c-49e3-ab64-741f07fd4cb3', 'eventDescription': 'Page Load'}

df['eventData']
RETURNS:
{'type': 'page', 'name': "WHAT'S UP"}

How do I do this?我该怎么做呢?

I would look at using the pandas apply method on the event column.我会考虑在event列上使用pandas apply方法。

If the eventData key is expected to be present in the event column dictionary for all rows of the data frame, something below may suffice如果希望eventData键出现在数据框所有行的event列字典中,则以下内容可能就足够了

import json
import numpy as np

def get_event_data_from_event(event_str):
    """
    Convert event string to dict and return event_data
    """
    try:
        event_as_dict = json.loads(event_str)
    except json.decoder.JSONDecodeError:
        return np.nan
    else
        if not "eventData" in event_as_dict.keys():
            return np.nan
        return event_as_dict["eventData"]  

df["eventData"] = df["event"].apply(lambda x: get_event_data_from_event(x))

Which will return an N/A for that row in the eventData column if the event dictionary is not formatted as you expected it to be.如果event字典的格式不符合您的预期,它将为eventData列中的该行返回 N/A。

You could then drop those non-conforming rows with a dropna like so:然后,您可以使用dropna删除那些不符合要求的行, 如下所示:

df_subset = df.dropna(axis='columns', subset="eventData")

I've finally fot the answer from another post: Python flatten multilevel/nested JSON我终于找到了另一篇文章的答案: Python flatten multilevel/nested JSON

How to use: json_col = pd.DataFrame([flatten_json(x) for x in df['json_column']])使用方法:json_col = pd.DataFrame([flatten_json(x) for x in df['json_column']])

def flatten_json(nested_json, exclude=['']):
    out = {}
    def flatten(x, name='', exclude=exclude):
        if type(x) is dict:
            for a in x:
                if a not in exclude: flatten(x[a], name + a + '_')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '_')
                i += 1
        else:
            out[name[:-1]] = x

    flatten(nested_json)
    return out 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM