简体   繁体   中英

make pandas DataFrame to a dict and dropna

I have some pandas DataFrame with NaNs in it. Like this:

import pandas as pd
import numpy as np
raw_data={'A':{1:2,2:3,3:4},'B':{1:np.nan,2:44,3:np.nan}}
data=pd.DataFrame(raw_data)
>>> data
   A   B
1  2 NaN
2  3  44
3  4 NaN

Now I want to make a dict out of it and at the same time remove the NaNs. The result should look like this:

{'A': {1: 2, 2: 3, 3: 4}, 'B': {2: 44.0}}

But using pandas to_dict function gives me a result like this:

>>> data.to_dict()
{'A': {1: 2, 2: 3, 3: 4}, 'B': {1: nan, 2: 44.0, 3: nan}} 

So how to make a dict out of the DataFrame and get rid of the NaNs ?

There are many ways you could accomplish this, I spent some time evaluating performance on a not-so-large (70k) dataframe. Although @der_die_das_jojo's answer is functional, it's also pretty slow.

The answer suggested by this question actually turns out to be about 5x faster on a large dataframe.

On my test dataframe ( df ):

Above method:

%time [ v.dropna().to_dict() for k,v in df.iterrows() ]
CPU times: user 51.2 s, sys: 0 ns, total: 51.2 s
Wall time: 50.9 s

Another slow method:

%time df.apply(lambda x: [x.dropna()], axis=1).to_dict(orient='rows')
CPU times: user 1min 8s, sys: 880 ms, total: 1min 8s
Wall time: 1min 8s

Fastest method I could find:

%time [ {k:v for k,v in m.items() if pd.notnull(v)} for m in df.to_dict(orient='rows')]
CPU times: user 14.5 s, sys: 176 ms, total: 14.7 s
Wall time: 14.7 s

The format of this output is a row-oriented dictionary, you may need to make adjustments if you want the column-oriented form in the question.

Very interested if anyone finds an even faster answer to this question.

First graph generate dictionaries per columns, so output is few very long dictionaries, number of dicts depends of number of columns.

I test multiple methods with perfplot and fastest method is loop by each column and remove missing values or None s by Series.dropna or with Series.notna in boolean indexing in larger DataFrames.

Is smaller DataFrames is fastest dictionary comprehension with testing missing values by NaN != NaN trick and also testing None s.

图形

np.random.seed(2020)
import perfplot

def comp_notnull(df1):
    return {k1: {k:v for k,v in v1.items() if pd.notnull(v)} for k1, v1 in df1.to_dict().items()}

def comp_NaNnotNaN_None(df1):
    return {k1: {k:v for k,v in v1.items() if v == v and v is not None} for k1, v1 in df1.to_dict().items()}

def comp_dropna(df1):
    return {k: v.dropna().to_dict() for k,v in df1.items()}

def comp_bool_indexing(df1):
    return {k: v[v.notna()].to_dict() for k,v in df1.items()}

def make_df(n):
    df1 = pd.DataFrame(np.random.choice([1,2, np.nan], size=(n, 5)), columns=list('ABCDE'))
    return df1

perfplot.show(
    setup=make_df,
    kernels=[comp_dropna, comp_bool_indexing, comp_notnull, comp_NaNnotNaN_None],
    n_range=[10**k for k in range(1, 7)],
    logx=True,
    logy=True,
    equality_check=False,
    xlabel='len(df)')

Another situtation is if generate dictionaries per rows - get list of huge amount of small dictionaries, then fastest is list comprehension with filtering NaNs and Nones:

g1

np.random.seed(2020)
import perfplot


def comp_notnull1(df1):
    return [{k:v for k,v in m.items() if pd.notnull(v)} for m in df1.to_dict(orient='r')]

def comp_NaNnotNaN_None1(df1):
    return [{k:v for k,v in m.items() if v == v and v is not None} for m in df1.to_dict(orient='r')]

def comp_dropna1(df1):
    return [v.dropna().to_dict() for k,v in df1.T.items()]

def comp_bool_indexing1(df1):
    return [v[v.notna()].to_dict() for k,v in df1.T.items()]


def make_df(n):
    df1 = pd.DataFrame(np.random.choice([1,2, np.nan], size=(n, 5)), columns=list('ABCDE'))
    return df1

perfplot.show(
    setup=make_df,
    kernels=[comp_dropna1, comp_bool_indexing1, comp_notnull1, comp_NaNnotNaN_None1],
    n_range=[10**k for k in range(1, 7)],
    logx=True,
    logy=True,
    equality_check=False,
    xlabel='len(df)')

write a function insired by to_dict from pandas

import pandas as pd
import numpy as np
from pandas import compat 

def to_dict_dropna(self,data):
  return dict((k, v.dropna().to_dict()) for k, v in compat.iteritems(data))

raw_data={'A':{1:2,2:3,3:4},'B':{1:np.nan,2:44,3:np.nan}}
data=pd.DataFrame(raw_data)

dict=to_dict_dropna(data)

and as a result you get what you want:

>>> dict
{'A': {1: 2, 2: 3, 3: 4}, 'B': {2: 44.0}}

you can have your own mapping class where you can get rid of the NANs:

class NotNanDict(dict):

    @staticmethod
    def is_nan(v):
        if isinstance(v, dict):
            return False
        return np.isnan(v)

    def __new__(self, a):
        return {k: v for k, v in a if not self.is_nan(v)} 

data.to_dict(into=NotNanDict)

Output:

{'A': {1: 2, 2: 3, 3: 4}, 'B': {2: 44.0}}

Timing (from @jezrael answer):

在此处输入图片说明

to boost the speed you can use numba :

from numba import jit

@jit
def dropna(arr):
    return [(i + 1, n) for i, n in enumerate(arr) if not np.isnan(n)]


class NotNanDict(dict):

    def __new__(self, a):
        return {k: dict(dropna(v.to_numpy())) for k, v in a}

data.to_dict(orient='s', into=NotNanDict)

output:

{'A': {1: 2, 2: 3, 3: 4}, 'B': {2: 44.0}}

Timing (from @jezrael answer):

在此处输入图片说明

您可以使用字典理解并遍历列

{col:df[col].dropna().to_dict() for col in df}

Try the code below,

import numpy as np
import pandas as pd
raw_data = {'A': {1: 2, 2: 3, 3: 4}, 'B': {1: np.nan, 2: 44, 3: np.nan}}
data = pd.DataFrame(raw_data)
{col: data[col].dropna().to_dict() for col in data}

Output

{'A': {1: 2, 2: 3, 3: 4}, 'B': {2: 44.0}} 

There are a lot of ways of solving that. Depending of the number of rows the fastest methods will change. Since performance is relevant I understand that the number of rows is big.

import pandas as pd
import numpy as np

# Create a dataframe with random data
df = pd.DataFrame(np.random.randint(10, size=[1_000_000, 2]), columns=["A", "B"])

# Add some NaNs
df.loc[df["A"]==1, "B"] = np.nan

The fastest soluton I got is by simply using the dropna method and a dict comprehension:

%time {col: df[col].dropna().to_dict() for col in df.columns}

CPU times: user 528 ms, sys: 87.2 ms, total: 615 ms
Wall time: 615 ms

Which is 10 times faster compared to one of the proposed solutions:

Now if we test it with one of the proposed solutions we get:

%time [{k:v for k,v in m.items() if pd.notnull(v)} for m in df.to_dict(orient='rows')]

CPU times: user 5.49 s, sys: 205 ms, total: 5.7 s
Wall time: 5.69 s

It is also 2 times faster than other options like:

%time {k1: {k:v for k,v in v1.items() if v == v and v is not None} for k1, v1 in df.to_dict().items()}

CPU times: user 900 ms, sys: 133 ms, total: 1.03 s
Wall time: 1.03 s

The idea is to always try to use pandas or numpy builtin functions since they are faster than regular python.

I wrote a function to solve this problem without reimplementing to_dict, and without calling it more than once. The approach is to recursively trim out the "leaves" with nan/None value.

def trim_nan_leaf(tree):
    """For a tree of dict-like and list-like containers, prune None and NaN leaves.

    Particularly applicable for json-like dictionary objects
    """
    # d may be a dictionary, iterable, or other (element)
    # * Do not recursively iterate if string
    # * element is the base case
    # * Only remove nan and None leaves

    def valid_leaf(leaf):
        if leaf is None:
            return(False)
        if isinstance(leaf, numbers.Number):
            if (not math.isnan(leaf)):
                return(leaf != -9223372036854775808)
            return(False)
        return(True)

    # Attempt dictionary
    try:
        return({k: trim_nan_leaf(tree[k]) for k in tree.keys() if valid_leaf(tree[k])})
    except AttributeError:
        # Execute base case on string for simplicity...
        if isinstance(tree, str):
            return(tree)
        # Attempt iterator
        try:
            # Avoid infinite recursion for self-referential objects (like one-length strings!)
            if tree[0] == tree:
                return(tree)
            return([trim_nan_leaf(leaf) for leaf in tree if valid_leaf(leaf)])
        # TypeError occurs when either [] or iterator are availble
        except TypeError:
            # Base Case
            return(tree)

improving on the answer of https://stackoverflow.com/a/46098323

With a ~300K dataframe with 2 entire nan columns, his answer results:

%time [ {k:v for k,v in m.items() if pd.notnull(v)} for m in df.to_dict(orient='records')]
CPU times: user 8.63 s, sys: 137 ms, total: 8.77 s Wall time: 8.79 s

With a tiny twist:

%time [ {k:v for k,v in m.items()} for m in df.dropna(axis=1).to_dict(orient='records')]
CPU times: user 4.37 s, sys: 109 ms, total: 4.48 s Wall time: 4.49 s

The idea is to always drop nan first, so to avoid unnecessary iteration on nan value. On the first answer nan is converted into dict first before being dropped, which can be optimized.

Optimal solution that takes care of de-normalized dataframes

I had problems when the cell did contain lists or series (dataframes that weren't normalized) I solved creating my own isna() function:

def is_na(possible_array):
    is_not_na = pd.notnull(possible_array)
    if isinstance(is_not_na, bool):
        return not is_not_na
    return not is_not_na.any()
    
    
def to_json(df, remove_missing_fields=True):
    rows = df.to_dict("records")
    if remove_missing_fields:
        return [{k:v for k,v in m.items() if not is_na(v)} for m in df.to_dict(orient='records')]
    else:
        return rows

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM