简体   繁体   中英

Not sure which data structure to use

Assuming I have the following text:

today was a good day and today was a sunny day. 

I break up this text into lines, seperated by white spaces, which is

Today

was

a

good

etc.

Now I use the vector data structure to simple count the number of words in a text via .size(). That's done.

However, I also want to check If a word comes up more than once, and if so, how many time. In my example "today" comes up 2 times.

I want to store that "today" and append a 2/x (depending how often it comes up in a large text). Now that's not just for "today" but for every word in the text. I want to look up how often a word appears, append an counter, and sort it (the word + counters) in descending order (that's another thing, but not important right now).

I'm not sure which data structure to use here. Map perhaps? But I can't add counters to map.

Edit: This is what I've done so far: http://pastebin.com/JncR4kw9

You should use a map. Infact, you should use an unordered_map .

unordered_map<string,int> will give you a hash table which will use strings as keys, and you can augment the integer to keep count.

unordered_map has the advantage of O(1) lookup and insertion over the O(logn) lookup and insertion of a map . This is because the former uses an array as a container whereas the latter uses some implementation of trees (red black, I think).

The only disadvantage of an unordered_map is that as mentioned in its name, you can't iterate over all the elements in lexical order. This should be clear from the explanation of their structure above. However, you don't seem to need such a traversal, and hence it shouldn't be an issue.

unordered_map<string,int> mymap;
mymap[word]++; // will increment the counter associated with the count of a word.

Why not use two data structures? The vector you have now, and a map , using the string as the key, and an integer as data, which then will be the number of times the word was found in the text.

Sort the vector in alphabetical order. Scan it and compare every word to those that follow, until you find a different one, and son on.

a, a, and, day, day, sunny, today, today, was, was
2     1    2         1      2             2

A better option to consider is Radix Tree, https://en.wikipedia.org/wiki/Radix_tree Which is quite memory efficient, and in case of large text input, it will perform better than alternative data structures.

One can store the frequencies of a word in the nodes of tree. Also it will reap the benefits of "locality of reference[For any text document]" too.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM