简体   繁体   中英

N-grams - not in memory

I have 3 milion abstracts and I would like to extract 4-grams from them. I want to build a language model so I need to find the frequencies of these 4-grams.

My problem is that I can't extract all these 4-grams in memory. How can I implement a system that it can estimate all frequencies for these 4-grams?

Sounds like you need to store the intermediate frequency counts on disk rather than in memory. Luckily most databases can do this, and python can talk to most databases.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM