简体   繁体   English

熊猫从值缺失,长度不同的拆分列中创建新的DataFrame

[英]Pandas creating new DataFrame from split columns with missing values, differing lengths

I have a bunch of categorical data that is concatenated together as a string ( string string string , etc) in each of my rows. 我有一堆分类数据,它们在我的每一行中都作为一个字符串( string string string等)串联在一起。 I need to separate these categorical data into columns so that I have something that looks like this: 我需要将这些分类数据分为几列,以便获得如下所示的内容:

       var1 var2 var3
row 1:    1    1    0
row 2:    0    0    1
row 3:    0    1    1   ....

The categorical data can be generated using this: 可以使用以下方法生成分类数据:

import pandas as pd, numpy as np

np.random.seed(75)
a = [" ".join(np.unique(['var%d'%np.random.randint(5) for i in range(np.random.randint(10))])) for i in range(np.random.randint(10))]
a[0] = np.nan

b = pd.DataFrame(a, columns=['descriptor'])
print(b)

And looks like this: 看起来像这样:

                 descriptor
0                       NaN
1                 var0 var1
2       var0 var1 var2 var3
3            var0 var2 var4
4                 var0 var4
5                      var3
6  var0 var1 var2 var3 var4
7            var0 var1 var2

And I would like to construct the following 我想构造以下内容

   var0  var1  var2  var3  vars4
0     0     0     0     0      0
1     1     1     0     0      0
2     1     1     1     1      0
3     1     0     1     0      1
4     1     0     0     0      1
5     0     0     0     1      0
6     1     1     1     1      1
7     1     1     1     0      0

But I'm not sure how to approach this. 但是我不确定该如何处理。 The DataFrame is fairly large (25,000 rows with 5000 descriptors) so the method would have to be performant. DataFrame非常大(25,000行,带有5000个描述符),因此该方法必须高效。 I tried using str.split(' ').tolist() , but I'm not sure how to take that and convert it to the final product. 我尝试使用str.split(' ').tolist() ,但是我不确定如何将其转换为最终产品。

First, fill the missing value with some kind of sentinel, and split the data. 首先,用某种标记来填充缺失值,然后拆分数据。

b = b.fillna('_')
b['descriptor'] = b.str.split()

Then, using the approach from this answer , pivot into columns, and drop the sentinel. 然后,使用此答案中的方法,将其透视成列,然后放下前哨。 This won't be ultra performant because of the apply, but probably reasonable on data of your size. 由于应用的原因,它不会表现出色,但在您的大小的数据上可能是合理的。

s = b['descriptor'].apply(lambda x: pd.Series(1,index=x)).fillna(0).drop('_', axis=1)

In [115]: s
Out[115]: 
   var0  var1  var2  var3  var4
0     0     0     0     0     0
1     1     1     0     0     0
2     1     1     1     1     0
3     1     0     1     0     1
4     1     0     0     0     1
5     0     0     0     1     0
6     1     1     1     1     1
7     1     1     1     0     0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何创建具有不同行长的多列的新python DataFrame? - How to create a new python DataFrame with multiple columns of differing row lengths? 从 pandas dataframe 中的列创建新列 - Creating new column from columns in pandas dataframe Pandas - 将一个数据框中的列与另一个数据框中的多个列匹配,并从原始数据框创建新列 - Pandas - matching values from a column in one dataframe to several columns in another dataframe and creating new columns from the original dataframe 识别与前几天相比的缺失值并创建新的 dataframe、Pandas - Recognising Missing Values compared to previous days and creating new dataframe, Pandas 从具有不同长度值的字典创建 DataFrame - Creating DataFrame from dictionary with different lengths of values 从另一个具有不同索引的 dataframe 在 pandas dataframe 添加新列 - Adding a new column in pandas dataframe from another dataframe with differing indices 拆分不同长度的列 - Split Column with Differing Lengths 从现有 dataframe 的某些列创建新的 pandas dataframe - Creating new pandas dataframe from certain columns of existing dataframe 比较 2 个 pandas 数据框列并根据值是否相同创建新列 - Comparing 2 pandas dataframe columns and creating new column based on if the values are same or not 根据唯一值创建 pandas DataFrame 的新列? - Creating new columns of pandas DataFrame based on unique values?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM