DictVectorizer with a large dataset

Question

I have a large dataset with categorical values and tried to encode them using DictVectorizer . The following is a snippet of the code I tried.

dv = DictVectorizer(sparse=True)
_dicts = []
for line in fp:
    _dict = create_dict_feature(line)
    _dicts.append(_dict)
dv.fit_transform(_dicts)

But, MemoryError occurs in _dicts.append(_dict) . I am wondering what would be an efficient way of getting around this problem.

Answer 1

According to the docs, fit_transform can take an iterable. If the memory issue is coming from the size of the list, consider using a generator instead of a list , which will yield your dict s one at a time as it is iterated.

_dicts = (create_dict_feature(line) for line in fp)
dv = DictVectorizer(sparse=True)
dv.fit_transform(_dicts)

This won't help much if fit_transform accumulates the dict s or Mapping s just as you were doing before.

DictVectorizer with a large dataset

Question

1 answers

solution1
1 2016-10-01 07:30:31

DictVectorizer with a large dataset

Question

1 answers

solution1 1 2016-10-01 07:30:31

solution1
1 2016-10-01 07:30:31