I have used gensim.utils.simple_preprocess(str(sentence) to create a dictionary of words that I want to use for topic modelling. However, this is also filtering important numbers (house resolutions, bill no, etc) that I really need. How did I overcome this? Possibly by replacing digits with their word form. How do i go about it, though?
You don't have to use simple_preprocess()
- it's not doing much, it's not that configurable or sophisticated, and typically the other Gensim algorithms just need lists-of-tokens.
So, choose your own tokenization - which in some cases, depnding on your source data, could be as simple as a .split()
on whitespace.
If you want to look at what simple_preprocess()
does, as a model, you can view its Python source at:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.