简体   繁体   中英

How to find all unique words in a 10 GB file or more & enable search, using JavaScript?

The question is to implement a web service that can read a 10GB file and store all distinct words & their occurrences. The requirements needs to be solved in O(n) or better complexity. The next part of the question is to write all client side code to allow search based on keypress. How do I approach this problem? What would you suggest, are the main sub-headings?Do we need to use some sort of in-memory caching? Can 1 computer handle searching 10GB of data? Is there an approximation I should consider for distinct words based on Language (For example, in Cracking the coding interview I read there are about 600,000 distinct words in a language). How do I handle scalability of a system built this way? I really need help structuring my thoughts! Thanks in advance!

You shouldn't be using JavaScript for this. Pretty much any language will have better performance.

But, setting that aside, let's answer the question. What you'll want to do is create a Set and iterate through all words. Given the size of the data, you'll probably want to split it into chunks beforehand or at read time.

Just adding the key to the Set every time will suffice, as set only contains unique elements.

Alternatively, if you have 10+GB of RAM, just put the whole thing into an array and cast it to a set. Then you'll be able to read the unique values. It'll take quite a while, though.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM