熊猫从值缺失，长度不同的拆分列中创建新的DataFrame

Question

I have a bunch of categorical data that is concatenated together as a string ( string string string , etc) in each of my rows. 我有一堆分类数据，它们在我的每一行中都作为一个字符串（ string string string等）串联在一起。 I need to separate these categorical data into columns so that I have something that looks like this: 我需要将这些分类数据分为几列，以便获得如下所示的内容：

       var1 var2 var3
row 1:    1    1    0
row 2:    0    0    1
row 3:    0    1    1   ....

The categorical data can be generated using this: 可以使用以下方法生成分类数据：

import pandas as pd, numpy as np

np.random.seed(75)
a = [" ".join(np.unique(['var%d'%np.random.randint(5) for i in range(np.random.randint(10))])) for i in range(np.random.randint(10))]
a[0] = np.nan

b = pd.DataFrame(a, columns=['descriptor'])
print(b)

And looks like this: 看起来像这样：

                 descriptor
0                       NaN
1                 var0 var1
2       var0 var1 var2 var3
3            var0 var2 var4
4                 var0 var4
5                      var3
6  var0 var1 var2 var3 var4
7            var0 var1 var2

And I would like to construct the following 我想构造以下内容

   var0  var1  var2  var3  vars4
0     0     0     0     0      0
1     1     1     0     0      0
2     1     1     1     1      0
3     1     0     1     0      1
4     1     0     0     0      1
5     0     0     0     1      0
6     1     1     1     1      1
7     1     1     1     0      0

But I'm not sure how to approach this. 但是我不确定该如何处理。 The DataFrame is fairly large (25,000 rows with 5000 descriptors) so the method would have to be performant. DataFrame非常大（25,000行，带有5000个描述符），因此该方法必须高效。 I tried using str.split(' ').tolist() , but I'm not sure how to take that and convert it to the final product. 我尝试使用str.split(' ').tolist() ，但是我不确定如何将其转换为最终产品。

Answer 1

First, fill the missing value with some kind of sentinel, and split the data. 首先，用某种标记来填充缺失值，然后拆分数据。

b = b.fillna('_')
b['descriptor'] = b.str.split()

Then, using the approach from this answer , pivot into columns, and drop the sentinel. 然后，使用此答案中的方法，将其透视成列，然后放下前哨。 This won't be ultra performant because of the apply, but probably reasonable on data of your size. 由于应用的原因，它不会表现出色，但在您的大小的数据上可能是合理的。

s = b['descriptor'].apply(lambda x: pd.Series(1,index=x)).fillna(0).drop('_', axis=1)

In [115]: s
Out[115]: 
   var0  var1  var2  var3  var4
0     0     0     0     0     0
1     1     1     0     0     0
2     1     1     1     1     0
3     1     0     1     0     1
4     1     0     0     0     1
5     0     0     0     1     0
6     1     1     1     1     1
7     1     1     1     0     0

熊猫从值缺失，长度不同的拆分列中创建新的DataFrame

问题描述

1 个解决方案

解决方案1
0 2014-08-13 15:16:47

熊猫从值缺失，长度不同的拆分列中创建新的DataFrame

问题描述

1 个解决方案

解决方案1 0 2014-08-13 15:16:47

解决方案1
0 2014-08-13 15:16:47