简体   繁体   English

从python中数据框的列中提取非空值

[英]Extract non- empty values from columns of a dataframe in python

This is a follow up of this question: Extract non- empty values from the regex array output in python 这是该问题的后续内容: 从python中的regex数组输出中提取非空值

I have a DF with columns "col" and "col1" of type 'numpy.ndarray' and looks like : 我有一个DF,列“ col”和“ col1”的类型为“ numpy.ndarray”,看起来像:

       col                         col1
   [[5, , , ,]]             [qwe,ret,der,po]
   [[, 4, , ,][, , 5, ]]       [fgk,hfrt]
        []                           []
   [[, , , 9]]                  [test]  

I want my output as: 我希望我的输出为:

      col  col1
       5  qwe,ret,der,po
       5  fgk,hfrt
       0  NOT FOUND 
       9  test

Please note column "col", second row has maximum of the two entries in the output. 请注意列“ col”,第二行具有输出中两个条目的最大值。 I tried the solution provided in the above link but its giving ValueError "The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()" 我尝试了上面链接中提供的解决方案,但给出ValueError“具有多个元素的数组的真值不明确。请使用a.any()或a.all()”

Thanks 谢谢

Edit: Dictionary form of my DF with column "col": 编辑:我的DF与列“ col”的字典形式:

  {'col': {0: array([['5', '', '', '', '', '']],
  dtype='|S1'), 1: array([], dtype=float64), 2: array([], dtype=float64), 3: array([], dtype=float64), 4: array([], dtype=float64), 5: array([['8', '', '', '', '', '']],
  dtype='|S1'), 6: array([], dtype=float64), 7: array([], dtype=float64), 8: array([], dtype=float64), 9: array([], dtype=float64), 10: array([], dtype=float64), 11: array([['', '8', '', '', '', '']],
  dtype='|S1'), 12: array([], dtype=float64), 13: array([], dtype=float64), 14: array([], dtype=float64), 15: array([['7', '', '', '', '', '']],
  dtype='|S1'), 16: array([], dtype=float64)}}

Try the following: 请尝试以下操作:

import pandas as pd


def parse_nested_max(xss):
    return max(
        (max((int(x) for x in xs if x), default=0) for xs in xss),
        default=0
    )


df['col'] = df.col.apply(parse_nested_max)
df['col1'] = df.col1.apply(lambda s: ','.join(s) or 'NOT FOUND')

This assumes that the first column is a 2-dim array of type string, and the second is 1-dim array of type string. 假定第一列是字符串类型的2维数组,第二列是字符串类型的1维数组。

For the first column, do the following: 对于第一列,请执行以下操作:

  1. For each subarray, drop '' elements and convert rest to int 对于每个子数组,删除''元素并将rest转换为int
  2. For each subarray, compute max with the convention that max([]) == 0 对于每个子阵列,计算max与约定max([]) == 0
  3. Finally, this gives a list of integers, so simply take the max; 最后,它给出了一个整数列表,因此只需取最大值即可; use default=0 to account for possibility of emptiness like in third row of your df . 使用default=0来解决空缺的可能性,如df第三行。

For the second column, exploit the fact that bool(','.join([])) == False . 对于第二列,利用bool(','.join([])) == False的事实。

Finally a tip: you will have better feedback if your dataframe is easy to recreate. 最后一个提示:如果您的数据框易于重新创建,您将获得更好的反馈。 Try using df.to_dict() and embedding the output in your source when you define df . 定义df时,请尝试使用df.to_dict()并将输出嵌入源中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM