[英]How to convert a dataframe to sparse matrix with mixed column types?
I have a data frame of following format: 我有以下格式的数据框:
df:
key f1 f2
k1 10 a, b, c
k2 20 b, d
k3 15 NaN
The column f2 has a bag of words as values. f2列具有一袋单词作为值。 I want to convert this data frame into a sparse matrix, as distinct words in f2 run to a few thousands.
我想将此数据帧转换为稀疏矩阵,因为f2中的不同单词多达数千个。 The end result I am expecting is of following format:
我期望的最终结果具有以下格式:
key f1 f2.a f2.b f2.c f2.d
k1 10 1 1 1 0
k2 20 0 1 0 1
k3 15 0 0 0 0
I could figure out how to independently create a sparse matrix just out of key and f2 field. 我可以弄清楚如何仅在key和f2字段之外独立创建一个稀疏矩阵。 I am first melting the column f2 so I get following dataframe:
我首先融化f2列,因此得到以下数据框:
df1:
key f2
k1 a
k1 b
k1 c
k2 b
k2 d
Then I am encoding f2, and using the LabelEncoder from sklearn.preprocessing package to encode f2. 然后,我编码f2,并使用sklearn.preprocessing包中的LabelEncoder编码f2。 Then I am creating a sparse matrix as follows:
然后,我将创建一个稀疏矩阵,如下所示:
df1['trainrow'] = np.arrange(df1.shape[0])
sparse.csr_matrix((np.ones(df1.shape[0], (df1.trainrow, df1.f2_encoded)))
This creates a sparse matrix by doing a one-hot encoding of field f2. 这通过对字段f2进行单次热编码来创建稀疏矩阵。 But I am not sure how I can concatenate this with the numerical field f1.
但是我不确定如何将其与数值字段f1连接起来。
You can use concat
with str.get_dummies
and add_prefix
: 您可以将
concat
与str.get_dummies
和add_prefix
:
df = pd.concat([df[['key','f1']], df.f2.str.get_dummies(sep=', ').add_prefix('f2.')], axis=1)
print (df)
key f1 f2.a f2.b f2.c f2.d
0 k1 10 1 1 1 0
1 k2 20 0 1 0 1
2 k3 15 0 0 0 0
In very large distinct values get_dummies
is very slow, you can use custom function f
: 在很大的不同值中,
get_dummies
非常慢,可以使用自定义函数f
:
def f(category_list):
n_categories = len(category_list)
return pd.Series(dict(zip(category_list, [1]*n_categories)))
#remove NaN rows and create list of values by split
df1 = df.f2.dropna().str.split(', ').apply(f).add_prefix('f2.')
df2 = pd.concat([df[['key','f1']], df1], axis=1)
#replace NaN to 0 by position from 3.column to end of df
df2.iloc[:, 2: ] = df2.iloc[:, 2: ].fillna(0).astype(int)
print (df2)
key f1 f2.a f2.b f2.c f2.d
0 k1 10 1 1 1 0
1 k2 20 0 1 0 1
2 k3 15 0 0 0 0
Timings : 时间 :
In [256]: %timeit s.str.get_dummies(sep=', ')
1 loop, best of 3: 1min 16s per loop
In [257]: %timeit (s.dropna().str.split(', ').apply(f).fillna(0).astype(int))
1 loop, best of 3: 2.95 s per loop
Code for timings : 计时代码 :
np.random.seed(100)
s = pd.DataFrame(np.random.randint(10000, size=(1000,1000))).astype(str).apply(', '.join, axis=1)
print (s)
df2 = s.str.get_dummies(sep=', ')
print (df2)
def f(category_list):
n_categories = len(category_list)
return pd.Series(dict(zip(category_list, [1]*n_categories)))
print (s.dropna().str.split(', ').apply(f).fillna(0).astype(int))
I have figured out the optimal way I wanted to solve this, so posting it as an answer for my future reference and for the benefit of others: 我已经找到了解决此问题的最佳方法,因此将其发布为我将来的参考和他人的答案:
Because of the enormous size of data, I had to go with sparse matrix only. 由于数据量巨大,我只需要使用稀疏矩阵。
First step is to convert the bag of words to a vectorized format. 第一步是将单词袋转换为矢量格式。 I have used CountVectorizer (Thanks to @MaxU for this) as follows:
我已经使用了CountVectorizer(为此,感谢@MaxU):
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
df2 = vectorizer.fit_transform(df['f2'].str.replace(' ',''))
I would like to ignore spaces and use comma as a forced delimiter. 我想忽略空格,并使用逗号作为强制定界符。 I couldn't figure out how to do that so I have replaced the spaces as otherwise vectorizer is splitting the words at spaces.
我不知道该怎么做,所以我替换了空格,否则矢量化程序会在空格处拆分单词。
That has created df1 as a sparse matrix. 这样就将df1创建为稀疏矩阵。
Then the other field f1 is converted to a different sparse matrix: 然后将另一个字段f1转换为另一个稀疏矩阵:
df1 = csr_matrix(df[['f1']].fillna(0))
Then used hstack to combine both these: sparseDF = hstack((df1,df2),format='csr') 然后使用hstack结合这两个:sparseDF = hstack((df1,df2),format ='csr')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.