简体   繁体   中英

Which data structure should I use to search a string from CSV?

I have a csv file with nearly 200000 rows containing two columns- name & job. The user then inputs a name, say user_name, and I have to search the entire csv to find the names that contain the pattern user_name and finally print the output to screen. I have implemented this using ArrayList in Java where I put the entire names from csv to ArrayList and then searched for the pattern in it. But in that case the overall time complexity for searching is O(n). Is there any other data strucure in Java that I can use to perform the searching in o(logn) or something more efficient than ArrayList? I can't use any database approach by the way. Also if there is a good data structure in any other language that I can use to accomplish my goal, then kindly suggest it to me?

Edit- The output should be the names in the csv that contains the pattern user_name as the last part. Eg: If my input is "son", then it should return "jackson",etc. Now what I have done so far is read the name column of csv to a string ArrayList, then read each element of the ArrayList and using the regular expression (pattern-matcher of Java) to see if the element has the user_name as the last part. If yes, then print it. If I implement this in a multi-threaded environment, will it increase the scalability and performance of my program?

You can use:

  • TreeMap , it is sorted red-black tree,

If you are unable to use a commercial database then you are going to have to write code to mimic some of a database's functionality.

To search the entire dataset sequentially in O(n) time you just read it and search each line. If you write a program that loads the data into an in-memory Map, you could search the Map in amortized O(1) time but you'd still be loading it into memory each time, which is an O(n) operation, gaining you nothing.

So the next approach is to build a disk-based index of some kind that you can search efficiently without reading the entire file, and then use the index to tell you where the record you want is located. This would be O(log n) , but now you are at significant complexity, building, maintaining and managing the disk-based index. This is what database systems are optimized to do.

If you had 200 MILLION rows, then the only feasible solution would be to use a database. For 200 THOUSAND rows, my recommendation is to just scan the file each time (ie use grep or if that's not available write a simple program to do something similar).

BTW, if your allusion to finding a "pattern" means you need to search for a regular expression, then you MUST scan the entire file every time since without knowing the pattern you cannot build an index.

In summary: use grep

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM