简体   繁体   中英

DictVectorizer Recognize Feature as String

The list of dictionaries that I am running through DictVectorizer (0.14) have specific categorical values that have been encoded to integers:

> dictionary_list[0:2]

Out:

[{u'Life': 3377, u'SerumX': 1015, u'duration': 3, u'gene_name': 37},
 {u'Life': 11655, u'SerumX': 1913, u'duration': 3, u'gene_name': 1}]

vec = DictVectorizer(sparse=False)
X = vec.fit_transform(dictionary_list)

For example genes APC, AXIN1, BLM might be encoded as 37, 1, 15 via some arbitrary method. That is to say these numbers are not some NLP expression of the characters/ngrams etc.

I am currently updating dicts within this list to convert values for key 'gene_name' from int to str :

for dicts in dictionary_list:
   dicts.update((k, str(v)) for k, v in dicts.iteritems() if k == 'gene_name')

> dictionary_list[0:2]

Out:

[{u'Life': 3377, u'SerumX': 1015, u'duration': 3, u'gene_name': '37'},
 {u'Life': 11655, u'SerumX': 1913, u'duration': 3, u'gene_name': '1'}]

I'm looking to optimize speed and avoid having to update the dict before passing it through DictVectorizer. I'm curious if there is a way to pass my list to DictVectorizer in a manner in which I can have it coerce the value of 'gene_name' as a string to utilize the built in encoding.

Many thanks to the scikit-learn team for their excellent work.

I guess you can speed things a bit if you change your code to something like

for dct in dictionary_list:
    if 'gene_name' in dct:
        dct['gene_name'] = str(dct['gene_name'])

I think you can't get away from coercing values to strings, as DictVectorizer uses isinstance(value, six.string_types) as a condition to filter out categorical values in provided data.

If I understand your code right, you are looping through all the keys to see if one of them is "gene_name" I'm guessing you are doing this because not all dictionaries might have that key.

If you did:

for dic in records_list:
    if 'gene_name' in dic:
        dic.update({ 'gene_name' , str(dic['gene_name']) })

you only access the key you want to change.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM