I have a large dataset with categorical values and tried to encode them using DictVectorizer
. The following is a snippet of the code I tried.
dv = DictVectorizer(sparse=True)
_dicts = []
for line in fp:
_dict = create_dict_feature(line)
_dicts.append(_dict)
dv.fit_transform(_dicts)
But, MemoryError
occurs in _dicts.append(_dict)
. I am wondering what would be an efficient way of getting around this problem.
According to the docs, fit_transform
can take an iterable. If the memory issue is coming from the size of the list, consider using a generator instead of a list
, which will yield your dict
s one at a time as it is iterated.
_dicts = (create_dict_feature(line) for line in fp)
dv = DictVectorizer(sparse=True)
dv.fit_transform(_dicts)
This won't help much if fit_transform
accumulates the dict
s or Mapping
s just as you were doing before.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.