在 pandas DataFrame 中查找缺失数据

Question

我正在尝试根据我列表中的数据找到一种方法来查找 dataframe 中丢失的数据。 每个接口必须有这 5 个子接口。

sub_interface_list = ['1030', '1035', '1039', '1050', '1059']

df = pd.DataFrame({'Device': ['DeviceA', 'DeviceA', 'DeviceA', 'DeviceA', 'DeviceA', 'DeviceA', 'DeviceA', 'DeviceA', 'DeviceA'], 'Interface': ['Eth-Trunk100', 'Eth-Trunk100', 'Eth-Trunk100', 'Eth-Trunk100', 'Eth-Trunk100', 'Eth-Trunk101', 'Eth-Trunk101', 'Eth-Trunk101', 'Eth-Trunk101'], 'Sub_interface': ['1030', '1035', '1039', '1050', '1059', '1030', '1039', '1050', '1059']})

dataframe 看起来像这样

Device  Interface   Sub_interface
DeviceA Eth-Trunk100    1030
DeviceA Eth-Trunk100    1035
DeviceA Eth-Trunk100    1039
DeviceA Eth-Trunk100    1050
DeviceA Eth-Trunk100    1059
DeviceA Eth-Trunk101    1030
DeviceA Eth-Trunk101    1039
DeviceA Eth-Trunk101    1050
DeviceA Eth-Trunk101    1059

从列表中我们可以看到 Eth-Trunk101 缺少 1035 的 sub_interface，我想将 1035 插入每个接口的最后一行。 我知道使用 dataframe.iterrows() 并且搜索丢失的元素很容易，但是在 pandas 中是否有任何方法可以在不使用 for 循环的情况下使用？

** 这是一个测试数据集，我的数据要大得多，使用迭代会非常耗时。

Answer 1

您可以使用 pyjanitor 的完整 function来公开缺失值：

df.complete(['Interface', 'Sub_interface'])

      Interface Sub_interface   Device
0  Eth-Trunk100          1030  DeviceA
1  Eth-Trunk100          1035  DeviceA
2  Eth-Trunk100          1039  DeviceA
3  Eth-Trunk100          1050  DeviceA
4  Eth-Trunk100          1059  DeviceA
5  Eth-Trunk101          1030  DeviceA
6  Eth-Trunk101          1035      NaN
7  Eth-Trunk101          1039  DeviceA
8  Eth-Trunk101          1050  DeviceA
9  Eth-Trunk101          1059  DeviceA

您可以使用ffill填充 null 值：

df.complete(['Interface', 'Sub_interface']).ffill()

如果你只想坚持 Pandas （pyjanitor 是 Pandas 的方便包装器的集合），下面的解决方案效果很好：

创建interface和sub_interface的唯一索引：

interface = pd.MultiIndex.from_product([df.Interface.unique(), 
                                        df.Sub_interface.unique()])

In [456]: interface
Out[456]: 
MultiIndex([('Eth-Trunk100', '1030'),
            ('Eth-Trunk100', '1035'),
            ('Eth-Trunk100', '1039'),
            ('Eth-Trunk100', '1050'),
            ('Eth-Trunk100', '1059'),
            ('Eth-Trunk101', '1030'),
            ('Eth-Trunk101', '1035'),
            ('Eth-Trunk101', '1039'),
            ('Eth-Trunk101', '1050'),
            ('Eth-Trunk101', '1059')],
           )

将interface和sub_interface设置为索引，使用interface和 reset_index 重新索引：

  df.set_index(['Interface', 'Sub_interface']).reindex(interface).reset_index()

 
      Interface Sub_interface   Device
0  Eth-Trunk100          1030  DeviceA
1  Eth-Trunk100          1035  DeviceA
2  Eth-Trunk100          1039  DeviceA
3  Eth-Trunk100          1050  DeviceA
4  Eth-Trunk100          1059  DeviceA
5  Eth-Trunk101          1030  DeviceA
6  Eth-Trunk101          1035      NaN
7  Eth-Trunk101          1039  DeviceA
8  Eth-Trunk101          1050  DeviceA
9  Eth-Trunk101          1059  DeviceA

此处重新索引有效，因为interface和sub_interface的组合是唯一的； 如果它不是唯一的，那么在outer合并是一个更好的步骤； complete在后台处理这些检查。

还要小心使用空值设置索引； Pandas 文档建议避免使用它-尽管到目前为止重新索引我没有注意到任何问题。

您也可以使用 unstack/stack，因为索引是唯一的：

df.set_index(['Interface', 'Sub_interface']).unstack().stack(dropna = False).reset_index()

     Interface Sub_interface   Device
0  Eth-Trunk100          1030  DeviceA
1  Eth-Trunk100          1035  DeviceA
2  Eth-Trunk100          1039  DeviceA
3  Eth-Trunk100          1050  DeviceA
4  Eth-Trunk100          1059  DeviceA
5  Eth-Trunk101          1030  DeviceA
6  Eth-Trunk101          1035      NaN
7  Eth-Trunk101          1039  DeviceA
8  Eth-Trunk101          1050  DeviceA
9  Eth-Trunk101          1059  DeviceA

Answer 2

一种方法是 pivot 然后堆叠：

(df.assign(dummy=1)
   .pivot_table(index=['Device','Interface'], columns='Sub_interface', 
                values='dummy', fill_value=1)
   .reindex(sub_interface_list, fill_value=1, axis=1)
   .stack().reset_index(name='dummy')
   .drop('dummy', axis=1)
)

Output：

    Device     Interface Sub_interface
0  DeviceA  Eth-Trunk100          1030
1  DeviceA  Eth-Trunk100          1035
2  DeviceA  Eth-Trunk100          1039
3  DeviceA  Eth-Trunk100          1050
4  DeviceA  Eth-Trunk100          1059
5  DeviceA  Eth-Trunk101          1030
6  DeviceA  Eth-Trunk101          1035
7  DeviceA  Eth-Trunk101          1039
8  DeviceA  Eth-Trunk101          1050
9  DeviceA  Eth-Trunk101          1059

在 pandas DataFrame 中查找缺失数据

问题描述

2 个解决方案

解决方案1
4 2021-03-09 03:05:32

解决方案2
1 2021-03-09 02:54:58

在 pandas DataFrame 中查找缺失数据

问题描述

2 个解决方案

解决方案1 4 2021-03-09 03:05:32

解决方案2 1 2021-03-09 02:54:58

解决方案1
4 2021-03-09 03:05:32

解决方案2
1 2021-03-09 02:54:58