简体   繁体   English

如何将多维列做成单值向量以训练sklearn熊猫中的数据

[英]How to make a multi-dimensional column into a single valued vector for training data in sklearn pandas

I have a data set in which certain column is a combination of couple of independent values, as in the example below: 我有一个数据集,其中某些列是几个独立值的组合,如下例所示:

id        age        marks
1          5          3,6,7
2          7          1,2
3          4          34,78,2

Thus the column by itself is composed of multiple values, and I need to pass the vector into a machine learning algorithm , I cannot really combine the values to assign a single value like : 因此,列本身就是由多个值组成的,因此我需要将向量传递到机器学习算法中,我无法真正组合这些值来分配单个值,例如:

3,6,7 => 1
1,2 => 2
34,78,2 => 3

making my new vector as 使我的新向量为

id        age        marks
1          5          1
2          7          2
3          4          3

and then subsequently pass it to the algorithm , as the number of such combination would be infinite and also that might not really capture the real meaning of the data. 然后将其传递给算法,因为这种组合的数量将是无限的,并且也可能无法真正捕获数据的真实含义。

how to handle such situation where individual feature is a combination of multiple features. 如何处理单个要素是多个要素的组合的情况。

Note : 注意 :

the values in column marks are just examples, it could be anything a list of values. 列标记中的值仅是示例,它可以是任何值列表。 it could be list of integer or list of string , string composed of multiple stings separated by commas 它可以是整数列表或字符串列表,字符串由多个用逗号分隔的字符串组成

UPDATE: I think we can use CountVectorizer in this case: 更新:我认为我们可以在这种情况下使用CountVectorizer

assuming we have the following DF: 假设我们有以下DF:

In [33]: df
Out[33]:
   id  age        marks
0   1    5    [3, 6, 7]
1   2    7       [1, 2]
2   3    4  [34, 78, 2]
3   4   11    [3, 6, 7]

In [34]: %paste
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import TreebankWordTokenizer

vect = CountVectorizer(ngram_range=(1,1), stop_words=None, tokenizer=TreebankWordTokenizer().tokenize)

X = vect.fit_transform(df.marks.apply(' '.join))

r = pd.DataFrame(X.toarray(), columns=vect.get_feature_names())
## -- End pasted text --

Result: 结果:

In [35]: r
Out[35]:
   1  2  3  34  6  7  78
0  0  0  1   0  1  1   0
1  1  1  0   0  0  0   0
2  0  1  0   1  0  0   1
3  0  0  1   0  1  1   0

OLD answer: 旧答案:

you can first convert your list to string and then categorize it: 您可以先将列表转换为字符串,然后再对其进行分类

In [119]: df
Out[119]:
   id  age        marks
0   1    5    [3, 6, 7]
1   2    7       [1, 2]
2   3    4  [34, 78, 2]
3   4   11    [3, 6, 7]

In [120]: df['new'] = pd.Categorical(pd.factorize(df.marks.str.join('|'))[0])

In [121]: df
Out[121]:
   id  age        marks new
0   1    5    [3, 6, 7]   0
1   2    7       [1, 2]   1
2   3    4  [34, 78, 2]   2
3   4   11    [3, 6, 7]   0

In [122]: df.dtypes
Out[122]:
id          int64
age         int64
marks      object
new      category
dtype: object

this will also work if marks is a column of strings: 如果marks是一列字符串,这也将起作用:

In [124]: df
Out[124]:
   id  age    marks
0   1    5    3,6,7
1   2    7      1,2
2   3    4  34,78,2
3   4   11    3,6,7

In [125]: df['new'] = pd.Categorical(pd.factorize(df.marks.str.join('|'))[0])

In [126]: df
Out[126]:
   id  age    marks new
0   1    5    3,6,7   0
1   2    7      1,2   1
2   3    4  34,78,2   2
3   4   11    3,6,7   0

Tp access them as either [[x, y, z], [x, y, z]] or [[x, x], [y, y], [z, z]] (whatever is most appropriate for the function you need to call) then use: Tp以[[x, y, z], [x, y, z]][[x, x], [y, y], [z, z]] (最适合该功能的是您需要致电),然后使用:

import pandas as pd
import numpy as np

df = pd.DataFrame(dict(a=[1, 2, 3, 4], b=[3, 4, 3, 4], c=[[1,2,3], [1,2], [], [2]]))
df.values
zip(*df.values)

where
>>> df

   a  b          c
0  1  3  [1, 2, 3]
1  2  4     [1, 2]
2  3  3         []
3  4  4        [2]
>>> df.values

array([[1, 3, [1, 2, 3]],
       [2, 4, [1, 2]],
       [3, 3, []],
       [4, 4, [2]]], dtype=object)
>>> zip(*df.values)

[(1, 2, 3, 4), (3, 4, 3, 4), ([1, 2, 3], [1, 2], [], [2])]

To convert a column try this: 要转换列,请尝试以下操作:

import pandas as pd
import numpy as np

df = pd.DataFrame(dict(a=[1, 2], b=[3, 4], c=[[1,2,3], [1,2]]))
df['c'].apply(lambda x: np.mean(x))

before: 之前:

>>> df
   a  b          c
0  1  3  [1, 2, 3]
1  2  4     [1, 2]

after: 后:

>>> df
   a  b    c
0  1  3  2.0
1  2  4  1.5

You can pd.factorize tuples 您可以pd.factorize tuples
Assuming marks is a list 假设marks是一个列表

df

   id  age        marks
0   1    5    [3, 6, 7]
1   2    7       [1, 2]
2   3    4  [34, 78, 2]
3   4    5    [3, 6, 7]

Apply tuple and factorize 应用tuple并分解

df.assign(new=pd.factorize(df.marks.apply(tuple))[0] + 1)

   id  age        marks  new
0   1    5    [3, 6, 7]    1
1   2    7       [1, 2]    2
2   3    4  [34, 78, 2]    3
3   4    5    [3, 6, 7]    1

setup df 设置df

df = pd.DataFrame([
        [1, 5, ['3', '6', '7']],
        [2, 7, ['1', '2']],
        [3, 4, ['34', '78', '2']],
        [4, 5, ['3', '6', '7']]
    ], [0, 1, 2, 3], ['id', 'age', 'marks']
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM