[英]How to make a multi-dimensional column into a single valued vector for training data in sklearn pandas
I have a data set in which certain column is a combination of couple of independent values, as in the example below: 我有一个数据集,其中某些列是几个独立值的组合,如下例所示:
id age marks
1 5 3,6,7
2 7 1,2
3 4 34,78,2
Thus the column by itself is composed of multiple values, and I need to pass the vector into a machine learning algorithm , I cannot really combine the values to assign a single value like : 因此,列本身就是由多个值组成的,因此我需要将向量传递到机器学习算法中,我无法真正组合这些值来分配单个值,例如:
3,6,7 => 1
1,2 => 2
34,78,2 => 3
making my new vector as 使我的新向量为
id age marks
1 5 1
2 7 2
3 4 3
and then subsequently pass it to the algorithm , as the number of such combination would be infinite and also that might not really capture the real meaning of the data. 然后将其传递给算法,因为这种组合的数量将是无限的,并且也可能无法真正捕获数据的真实含义。
how to handle such situation where individual feature is a combination of multiple features. 如何处理单个要素是多个要素的组合的情况。
Note : 注意 :
the values in column marks are just examples, it could be anything a list of values. 列标记中的值仅是示例,它可以是任何值列表。 it could be list of integer or list of string , string composed of multiple stings separated by commas
它可以是整数列表或字符串列表,字符串由多个用逗号分隔的字符串组成
UPDATE: I think we can use CountVectorizer in this case: 更新:我认为我们可以在这种情况下使用CountVectorizer :
assuming we have the following DF: 假设我们有以下DF:
In [33]: df
Out[33]:
id age marks
0 1 5 [3, 6, 7]
1 2 7 [1, 2]
2 3 4 [34, 78, 2]
3 4 11 [3, 6, 7]
In [34]: %paste
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import TreebankWordTokenizer
vect = CountVectorizer(ngram_range=(1,1), stop_words=None, tokenizer=TreebankWordTokenizer().tokenize)
X = vect.fit_transform(df.marks.apply(' '.join))
r = pd.DataFrame(X.toarray(), columns=vect.get_feature_names())
## -- End pasted text --
Result: 结果:
In [35]: r
Out[35]:
1 2 3 34 6 7 78
0 0 0 1 0 1 1 0
1 1 1 0 0 0 0 0
2 0 1 0 1 0 0 1
3 0 0 1 0 1 1 0
OLD answer: 旧答案:
you can first convert your list to string and then categorize it: 您可以先将列表转换为字符串,然后再对其进行分类 :
In [119]: df
Out[119]:
id age marks
0 1 5 [3, 6, 7]
1 2 7 [1, 2]
2 3 4 [34, 78, 2]
3 4 11 [3, 6, 7]
In [120]: df['new'] = pd.Categorical(pd.factorize(df.marks.str.join('|'))[0])
In [121]: df
Out[121]:
id age marks new
0 1 5 [3, 6, 7] 0
1 2 7 [1, 2] 1
2 3 4 [34, 78, 2] 2
3 4 11 [3, 6, 7] 0
In [122]: df.dtypes
Out[122]:
id int64
age int64
marks object
new category
dtype: object
this will also work if marks
is a column of strings: 如果
marks
是一列字符串,这也将起作用:
In [124]: df
Out[124]:
id age marks
0 1 5 3,6,7
1 2 7 1,2
2 3 4 34,78,2
3 4 11 3,6,7
In [125]: df['new'] = pd.Categorical(pd.factorize(df.marks.str.join('|'))[0])
In [126]: df
Out[126]:
id age marks new
0 1 5 3,6,7 0
1 2 7 1,2 1
2 3 4 34,78,2 2
3 4 11 3,6,7 0
Tp access them as either [[x, y, z], [x, y, z]]
or [[x, x], [y, y], [z, z]]
(whatever is most appropriate for the function you need to call) then use: Tp以
[[x, y, z], [x, y, z]]
或[[x, x], [y, y], [z, z]]
(最适合该功能的是您需要致电),然后使用:
import pandas as pd
import numpy as np
df = pd.DataFrame(dict(a=[1, 2, 3, 4], b=[3, 4, 3, 4], c=[[1,2,3], [1,2], [], [2]]))
df.values
zip(*df.values)
where
>>> df
a b c
0 1 3 [1, 2, 3]
1 2 4 [1, 2]
2 3 3 []
3 4 4 [2]
>>> df.values
array([[1, 3, [1, 2, 3]],
[2, 4, [1, 2]],
[3, 3, []],
[4, 4, [2]]], dtype=object)
>>> zip(*df.values)
[(1, 2, 3, 4), (3, 4, 3, 4), ([1, 2, 3], [1, 2], [], [2])]
To convert a column try this: 要转换列,请尝试以下操作:
import pandas as pd
import numpy as np
df = pd.DataFrame(dict(a=[1, 2], b=[3, 4], c=[[1,2,3], [1,2]]))
df['c'].apply(lambda x: np.mean(x))
before: 之前:
>>> df
a b c
0 1 3 [1, 2, 3]
1 2 4 [1, 2]
after: 后:
>>> df
a b c
0 1 3 2.0
1 2 4 1.5
You can pd.factorize
tuples
您可以
pd.factorize
tuples
Assuming marks
is a list 假设
marks
是一个列表
df
id age marks
0 1 5 [3, 6, 7]
1 2 7 [1, 2]
2 3 4 [34, 78, 2]
3 4 5 [3, 6, 7]
Apply tuple
and factorize 应用
tuple
并分解
df.assign(new=pd.factorize(df.marks.apply(tuple))[0] + 1)
id age marks new
0 1 5 [3, 6, 7] 1
1 2 7 [1, 2] 2
2 3 4 [34, 78, 2] 3
3 4 5 [3, 6, 7] 1
setup df
设置
df
df = pd.DataFrame([
[1, 5, ['3', '6', '7']],
[2, 7, ['1', '2']],
[3, 4, ['34', '78', '2']],
[4, 5, ['3', '6', '7']]
], [0, 1, 2, 3], ['id', 'age', 'marks']
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.