简体   繁体   English

用熊猫数据框中的空列表替换 NaN

[英]Replace NaN with empty list in a pandas dataframe

I'm trying to replace some NaN values in my data with an empty list [].我正在尝试用空列表 [] 替换数据中的一些 NaN 值。 However the list is represented as a str and doesn't allow me to properly apply the len() function.但是,该列表表示为 str 并且不允许我正确应用 len() 函数。 is there anyway to replace a NaN value with an actual empty list in pandas?有没有用熊猫中的实际空列表替换 NaN 值?

In [28]: d = pd.DataFrame({'x' : [[1,2,3], [1,2], np.NaN, np.NaN], 'y' : [1,2,3,4]})

In [29]: d
Out[29]:
           x  y
0  [1, 2, 3]  1
1     [1, 2]  2
2        NaN  3
3        NaN  4

In [32]: d.x.replace(np.NaN, '[]', inplace=True)

In [33]: d
Out[33]:
           x  y
0  [1, 2, 3]  1
1     [1, 2]  2
2         []  3
3         []  4

In [34]: d.x.apply(len)
Out[34]:
0    3
1    2
2    2
3    2
Name: x, dtype: int64

This works using isnull and loc to mask the series:这可以使用isnullloc来屏蔽系列:

In [90]:
d.loc[d.isnull()] = d.loc[d.isnull()].apply(lambda x: [])
d

Out[90]:
0    [1, 2, 3]
1       [1, 2]
2           []
3           []
dtype: object

In [91]:
d.apply(len)

Out[91]:
0    3
1    2
2    0
3    0
dtype: int64

You have to do this using apply in order for the list object to not be interpreted as an array to assign back to the df which will try to align the shape back to the original series您必须使用apply执行此操作,以便列表对象不会被解释为要分配回 df 的数组,该 df 将尝试将形状与原始系列对齐

EDIT编辑

Using your updated sample the following works:使用您更新的示例进行以下工作:

In [100]:
d.loc[d['x'].isnull(),['x']] = d.loc[d['x'].isnull(),'x'].apply(lambda x: [])
d

Out[100]:
           x  y
0  [1, 2, 3]  1
1     [1, 2]  2
2         []  3
3         []  4

In [102]:    
d['x'].apply(len)

Out[102]:
0    3
1    2
2    0
3    0
Name: x, dtype: int64

To extend the accepted answer, apply calls can be particularly expensive - the same task can be accomplished without it by constructing a numpy array from scratch.为了扩展公认的答案,apply 调用可能特别昂贵 - 通过从头开始构造一个 numpy 数组,可以在没有它的情况下完成相同的任务。

isna = df['x'].isna()
df.loc[isna, 'x'] = pd.Series([[]] * isna.sum()).values

A quick timing comparison:快速时序比较:

def empty_assign_1(s):
    s.isna().apply(lambda x: [])

def empty_assign_2(s):
    pd.Series([[]] * s.isna().sum()).values

series = pd.Series(np.random.choice([1, 2, np.nan], 1000000))

%timeit empty_assign_1(series)
>>> 172 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit empty_assign_2(series)
>>> 19.5 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Nearly 10 times faster!快了近 10 倍!

您还可以为此使用列表理解:

d['x'] = [ [] if x is np.NaN else x for x in d['x'] ]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM