简体   繁体   中英

DictVectorizer with a large dataset

I have a large dataset with categorical values and tried to encode them using DictVectorizer . The following is a snippet of the code I tried.

dv = DictVectorizer(sparse=True)
_dicts = []
for line in fp:
    _dict = create_dict_feature(line)
    _dicts.append(_dict)
dv.fit_transform(_dicts)

But, MemoryError occurs in _dicts.append(_dict) . I am wondering what would be an efficient way of getting around this problem.

According to the docs, fit_transform can take an iterable. If the memory issue is coming from the size of the list, consider using a generator instead of a list , which will yield your dict s one at a time as it is iterated.

_dicts = (create_dict_feature(line) for line in fp)
dv = DictVectorizer(sparse=True)
dv.fit_transform(_dicts)

This won't help much if fit_transform accumulates the dict s or Mapping s just as you were doing before.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM