简体   繁体   English

在其他列中基于NaN的Python新列

[英]Python new column based on NaN in other columns

I'm quite new to Python and this is my first ever question so please be gentle with me! 我是Python的新手,这是我的第一个问题,请对我保持温柔!

I have tried out answers to other similar questions but am still quite stuck. 我已经尝试了其他类似问题的答案,但仍然很困难。

I am using Pandas and I have a dataframe which is a merge from multiple different SQL tables and looks something like this: 我正在使用Pandas,并且有一个数据框,该数据框是来自多个不同SQL表的合并,看起来像这样:

Col_1   Col_2   Col_3   Col_4
1       NaN     NaN     NaN
2       Y       NaN     NaN
3       Z       C       S
4       NaN     B       W

I don't care about the values in Col_2 Col_3 and Col_4 (note these can be strings or integers or objects depending on the column) 我不在乎Col_2 Col_3和Col_4中的值(请注意,这些值可以是字符串,整数或对象,具体取决于列)

I just care that at least one of these columns is populated so ideally would like a fifth column like: 我只是在乎这些列中至少有一个是填充的,因此理想情况下是要填充第五列,例如:

Col_1   Col_2   Col_3   Col_4   Col_5
1       NaN     NaN     NaN     0
2       Y       NaN     NaN     1
3       Z       C       S       1
4       NaN     B       W       1

Then I want to drop the columns Col_2 to Col_4. 然后,我想将列Col_2放到Col_4。

My initial thought was something like the function below, but this is reducing my dataframe from 50000 rows to 50. I don't want to delete any rows. 我最初的想法类似于下面的函数,但这将我的数据帧从50000行减少到50行。我不想删除任何行。

def function(row):
   if (isnull.row['col_2'] and isnull.row['col_3'] and isnull.row['col_3'] is None):
      return '0'
   else:
      return '1'

df['col_5'] = df.apply(lambda row: function (row),axis=1)

Any help would be much appreciated. 任何帮助将非常感激。

Use any and pass param axis=1 which tests row-wise this will produce a boolean array which when converted to int will convert all True values to 1 and False values to 0 , this will be much faster than calling apply which is going to iterate row-wise and will be very slow: 使用any并通过param axis=1进行逐行测试,这将生成一个布尔数组,将其转换为int时会将所有True值转换为1并将False值转换为0 ,这比调用apply进行迭代要快得多逐行,将非常慢:

In [30]:

df['Col_5'] = any(df[df.columns[1:]].notnull(), axis=1).astype(int)
df
Out[30]:
   Col_1 Col_2 Col_3 Col_4  Col_5
0      1   NaN   NaN   NaN      0
1      2     Y   NaN   NaN      1
2      3     Z     C     S      1
3      4   NaN     B     W      1

In [31]:

df = df[['Col_1', 'Col_5']]
df
Out[31]:
   Col_1  Col_5
0      1      0
1      2      1
2      3      1
3      4      1

Here is the output from any : 这是any输出:

In [34]:

any(df[df.columns[1:]].notnull(), axis=1)
Out[34]:
array([False,  True,  True,  True], dtype=bool)

Timings 计时

In [35]:

%timeit df[df.columns[1:]].apply(lambda x: all(x.isnull()) , axis=1).astype(int)
%timeit any(df[df.columns[1:]].notnull(), axis=1).astype(int)
100 loops, best of 3: 2.46 ms per loop
1000 loops, best of 3: 1.4 ms per loop

So on your test data for a df this size my method is over 2x faster than the other answer 因此,对于大小为df的测试数据,我的方法比其他答案快2倍以上

Update 更新

As you are running pandas version 0.12.0 then you need to call the top level notnull version as that method is not available at df level: 当您运行pandas版本0.12.0您需要调用顶级notnull版本,因为该方法在df级别不可用:

any(pd.notnull(df[df.columns[1:]]), axis=1).astype(int)

I suggest you upgrade as you'll get lots more features and bug fixes. 我建议您升级,因为您将获得更多的功能和错误修复。

using a function: 使用功能:

df['col_5'] =df.apply(lambda x: all(x.isnull()) , axis=1)

for my money is a bit easier to read. 因为我的钱更容易阅读。 Not sure which is quicker. 不知道哪个更快。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM