[英]Pandas creating new DataFrame from split columns with missing values, differing lengths
I have a bunch of categorical data that is concatenated together as a string ( string string string
, etc) in each of my rows. 我有一堆分类数据,它们在我的每一行中都作为一个字符串(
string string string
等)串联在一起。 I need to separate these categorical data into columns so that I have something that looks like this: 我需要将这些分类数据分为几列,以便获得如下所示的内容:
var1 var2 var3
row 1: 1 1 0
row 2: 0 0 1
row 3: 0 1 1 ....
The categorical data can be generated using this: 可以使用以下方法生成分类数据:
import pandas as pd, numpy as np
np.random.seed(75)
a = [" ".join(np.unique(['var%d'%np.random.randint(5) for i in range(np.random.randint(10))])) for i in range(np.random.randint(10))]
a[0] = np.nan
b = pd.DataFrame(a, columns=['descriptor'])
print(b)
And looks like this: 看起来像这样:
descriptor
0 NaN
1 var0 var1
2 var0 var1 var2 var3
3 var0 var2 var4
4 var0 var4
5 var3
6 var0 var1 var2 var3 var4
7 var0 var1 var2
And I would like to construct the following 我想构造以下内容
var0 var1 var2 var3 vars4
0 0 0 0 0 0
1 1 1 0 0 0
2 1 1 1 1 0
3 1 0 1 0 1
4 1 0 0 0 1
5 0 0 0 1 0
6 1 1 1 1 1
7 1 1 1 0 0
But I'm not sure how to approach this. 但是我不确定该如何处理。 The DataFrame is fairly large (25,000 rows with 5000 descriptors) so the method would have to be performant.
DataFrame非常大(25,000行,带有5000个描述符),因此该方法必须高效。 I tried using
str.split(' ').tolist()
, but I'm not sure how to take that and convert it to the final product. 我尝试使用
str.split(' ').tolist()
,但是我不确定如何将其转换为最终产品。
First, fill the missing value with some kind of sentinel, and split the data. 首先,用某种标记来填充缺失值,然后拆分数据。
b = b.fillna('_')
b['descriptor'] = b.str.split()
Then, using the approach from this answer , pivot into columns, and drop the sentinel. 然后,使用此答案中的方法,将其透视成列,然后放下前哨。 This won't be ultra performant because of the apply, but probably reasonable on data of your size.
由于应用的原因,它不会表现出色,但在您的大小的数据上可能是合理的。
s = b['descriptor'].apply(lambda x: pd.Series(1,index=x)).fillna(0).drop('_', axis=1)
In [115]: s
Out[115]:
var0 var1 var2 var3 var4
0 0 0 0 0 0
1 1 1 0 0 0
2 1 1 1 1 0
3 1 0 1 0 1
4 1 0 0 0 1
5 0 0 0 1 0
6 1 1 1 1 1
7 1 1 1 0 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.