简体   繁体   中英

How to make a multi-dimensional column into a single valued vector for training data in sklearn pandas

I have a data set in which certain column is a combination of couple of independent values, as in the example below:

id        age        marks
1          5          3,6,7
2          7          1,2
3          4          34,78,2

Thus the column by itself is composed of multiple values, and I need to pass the vector into a machine learning algorithm , I cannot really combine the values to assign a single value like :

3,6,7 => 1
1,2 => 2
34,78,2 => 3

making my new vector as

id        age        marks
1          5          1
2          7          2
3          4          3

and then subsequently pass it to the algorithm , as the number of such combination would be infinite and also that might not really capture the real meaning of the data.

how to handle such situation where individual feature is a combination of multiple features.

Note :

the values in column marks are just examples, it could be anything a list of values. it could be list of integer or list of string , string composed of multiple stings separated by commas

UPDATE: I think we can use CountVectorizer in this case:

assuming we have the following DF:

In [33]: df
Out[33]:
   id  age        marks
0   1    5    [3, 6, 7]
1   2    7       [1, 2]
2   3    4  [34, 78, 2]
3   4   11    [3, 6, 7]

In [34]: %paste
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import TreebankWordTokenizer

vect = CountVectorizer(ngram_range=(1,1), stop_words=None, tokenizer=TreebankWordTokenizer().tokenize)

X = vect.fit_transform(df.marks.apply(' '.join))

r = pd.DataFrame(X.toarray(), columns=vect.get_feature_names())
## -- End pasted text --

Result:

In [35]: r
Out[35]:
   1  2  3  34  6  7  78
0  0  0  1   0  1  1   0
1  1  1  0   0  0  0   0
2  0  1  0   1  0  0   1
3  0  0  1   0  1  1   0

OLD answer:

you can first convert your list to string and then categorize it:

In [119]: df
Out[119]:
   id  age        marks
0   1    5    [3, 6, 7]
1   2    7       [1, 2]
2   3    4  [34, 78, 2]
3   4   11    [3, 6, 7]

In [120]: df['new'] = pd.Categorical(pd.factorize(df.marks.str.join('|'))[0])

In [121]: df
Out[121]:
   id  age        marks new
0   1    5    [3, 6, 7]   0
1   2    7       [1, 2]   1
2   3    4  [34, 78, 2]   2
3   4   11    [3, 6, 7]   0

In [122]: df.dtypes
Out[122]:
id          int64
age         int64
marks      object
new      category
dtype: object

this will also work if marks is a column of strings:

In [124]: df
Out[124]:
   id  age    marks
0   1    5    3,6,7
1   2    7      1,2
2   3    4  34,78,2
3   4   11    3,6,7

In [125]: df['new'] = pd.Categorical(pd.factorize(df.marks.str.join('|'))[0])

In [126]: df
Out[126]:
   id  age    marks new
0   1    5    3,6,7   0
1   2    7      1,2   1
2   3    4  34,78,2   2
3   4   11    3,6,7   0

Tp access them as either [[x, y, z], [x, y, z]] or [[x, x], [y, y], [z, z]] (whatever is most appropriate for the function you need to call) then use:

import pandas as pd
import numpy as np

df = pd.DataFrame(dict(a=[1, 2, 3, 4], b=[3, 4, 3, 4], c=[[1,2,3], [1,2], [], [2]]))
df.values
zip(*df.values)

where
>>> df

   a  b          c
0  1  3  [1, 2, 3]
1  2  4     [1, 2]
2  3  3         []
3  4  4        [2]
>>> df.values

array([[1, 3, [1, 2, 3]],
       [2, 4, [1, 2]],
       [3, 3, []],
       [4, 4, [2]]], dtype=object)
>>> zip(*df.values)

[(1, 2, 3, 4), (3, 4, 3, 4), ([1, 2, 3], [1, 2], [], [2])]

To convert a column try this:

import pandas as pd
import numpy as np

df = pd.DataFrame(dict(a=[1, 2], b=[3, 4], c=[[1,2,3], [1,2]]))
df['c'].apply(lambda x: np.mean(x))

before:

>>> df
   a  b          c
0  1  3  [1, 2, 3]
1  2  4     [1, 2]

after:

>>> df
   a  b    c
0  1  3  2.0
1  2  4  1.5

You can pd.factorize tuples
Assuming marks is a list

df

   id  age        marks
0   1    5    [3, 6, 7]
1   2    7       [1, 2]
2   3    4  [34, 78, 2]
3   4    5    [3, 6, 7]

Apply tuple and factorize

df.assign(new=pd.factorize(df.marks.apply(tuple))[0] + 1)

   id  age        marks  new
0   1    5    [3, 6, 7]    1
1   2    7       [1, 2]    2
2   3    4  [34, 78, 2]    3
3   4    5    [3, 6, 7]    1

setup df

df = pd.DataFrame([
        [1, 5, ['3', '6', '7']],
        [2, 7, ['1', '2']],
        [3, 4, ['34', '78', '2']],
        [4, 5, ['3', '6', '7']]
    ], [0, 1, 2, 3], ['id', 'age', 'marks']
)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM