简体   繁体   中英

Make uniq and indexing a huge set of String in java

I have a huge dataset of utf8 strings to process, I need to eliminate duplicate in order to have uniq set of string.

I'm using a hashet to check if the string is already know, but now I reached 100 000 000 strings, I do not have enough RAM and the process crash. Moreover, I only processed 1% of the dataset so in memory solution is impossible.

What I would like is a hybrid solution like a "in-memory index" and "disk-based storage" so I could use the 10Go of RAM I have to speed up the process.

=> Do you known a java library already doing this ? If not which algorithm should i look after ?

Using a bloom filter in memory to check if the string is not present could be a solution, but I still have to check the disk sometime (false positive) and I would like to know different solution.

=> How to store the strings on the disk to have a fast read and write access ?

_ I don't want to use an external service like a nosql db or mysql, it must be embedded.

_ I already try file based light SQL db like h2sql or hsql but they are very bad at handling massive dataset.

_ I don't consider using Trove/Guava Collections as a solution (unless they offer disk based solution I'm not aware of), I'm already using an extremly memory efficient custom hashset and I don't even store String but byte[] in memory. I already tweaked -Xmx stuff for the jvm.

EDIT: The dataset I'm processing is huge, the raw unsorted dataset doesn't fit on my hard disk. I'm streaming it byte per byte and processing it.

What you could do would be to use an External Sorting Technique such as the External Merge Sort in which you would sort your data first.

Once that this is done, you could iterate through the sorted set and keep the last element you have encountered. Once that you have that, you would check the current item with the next. If they are the same, you move on to the next item. If not, you would update the item you currently have.

To avoid huge memory consumption, you could dump your list of unique items to hard drive whenever a particular threshold is reached and keep on going.

Long story short:

Let data be the data set you need to work with
Let sorted_data = External_Merge_Sort(data)
Data_Element last_data = {}
Let unique_items be the set of unique items you want to yield
foreach element e in sorted_data
    if(e != last_data)
    {
        last_data = e
        add e in unique_items
        if (size(unique_items) == threshold)
        {
             dump_to_drive(unique_items)
        }
    }

What is the total data size you have ? If that is not in tera bytes and suppose you can use say 10 machines, I would suggest some external cache like memcached (spymemcached is a good java client form memcached).

Install memcached on the 10 nodes. Spymemcached client should be initialized with the list of memcached servers, so that they become a virtual cluster for our program.

For each string you read:
check if it is already in memcache
if it is in memcache:
   will check the next string
   continue
else:
   add it to memcache
   add it to list of string buffers to be flushed to disk
if size of the list of strings to be flushed >  certain threshold:
   flush them to disk
flush any remaining string to disk

Another approach is to use some kind of map-reduce :), without Hadoop:)

Deduplicate first 2 GB of Strings and writeout the de-duplicated stuff to an intermediate file
Follow the above step with next 2GB of Strings and so on.
Now apply the same method on the intermediate de-duplicated files.
When the total size of intermediate de-duplicated data is smaller, use Memcache or internal HashMap to produce the final output.
This approach doesn't involve sorting and hence may be efficient. 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM