简体   繁体   中英

Fastest way to remove a list of words from a file

I am trying to think of a fast and efficient way to delete a list of stop words from a file. Unfortunately I cant figure out a good way to do this.

The only method I have is to compare each word of the file to an array of stop words, comparing the word to every section of the array would be super slow, and considering the size of the file is 31 mb and that is the smallest of seven files to repeat the process for.

Considering the size, every nanosecond counts, so if anyone has any suggestions I would greatly appreciate it.

EDIT:: To give you guys a better idea of the files, I am sorting Stack overflow questions from 2008 to now, so essentially anything is possible, I am creating a search engine, but step one on that long long path is getting rid of words in questions that have no bearing or importance such "the", "a" etc. Then I have to add the words that are left to an AVL tree and up to me, catalog the location so for example if someone looks for c++ I can go to the tree find the node with c++ and in that node it has, C++ shows up in line 2003 of 2009.txt and 101 of 2012.txt, (for example). Hope the extra detail and final goal helps clear things up

Try the following approach:

  1. Put the stop words into a hash table, so checking if a word is in the list will take a constant time, ie O(1): https://en.wikipedia.org/wiki/Hash_table

  2. Memory map the file into memory: https://en.wikipedia.org/wiki/Memory-mapped_file

  3. In a loop, get next word and check if it is in the hash table.

  4. If the word in the hash table, just save the ranges to delete to a list.

  5. After all the words has beed checked, just walk through the list of ranges to delete and "compact" the file.

  6. Trim and close the memory mapped file to get the results written to disk.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM