简体   繁体   English

导入大量数据,高效搜索

[英]Importing a Large Amount of Data and Searching Efficiently

I'm currently writing a program that takes in two CSVs - one containing database keys (and other information irrelevant to the current issue), the other being an asset manifest.我目前正在编写一个接收两个 CSV 的程序——一个包含数据库密钥(以及与当前问题无关的其他信息),另一个是资产清单。 The program checks the database key from the first CSV, queries an online database to retrieve the asset key, then gets the asset status from the second CSV.该程序从第一个 CSV 中检查数据库密钥,查询在线数据库以检索资产密钥,然后从第二个 CSV 获取资产状态。 (This is a workaround to a stupid API issue.) (这是一个愚蠢的 API 问题的解决方法。)

My problem is that while the CSV that is being iterated over is relatively short - only about 300 lines long usually - the other is an asset manifest that is easily 10000 lines long (and sorted, though not by the key I can obtain from the first CSV).我的问题是,虽然正在迭代的 CSV 相对较短 - 通常只有大约 300 行长 - 另一个是资产清单,很容易长 10000 行(并且排序,虽然不是按我可以从第一个获得的密钥CSV)。 I obviously don't want to iterate over the entire asset manifest for every single input line, since that will take roughly 10 eternities.我显然不想为每个输入行迭代整个资产清单,因为这将花费大约 10 个永恒。

I'm a fairly inexperienced programmer, so I only know of sorting/searching algorithms, and I definitely don't know what would be the one to use for this.我是一个相当缺乏经验程序员,所以我只知道排序/搜索算法,我绝对不知道用什么来做这个。 What algorithm would be the most efficient?什么算法最有效? Is there a way to "batch-query" the manifest for all of the assets listed in the input CSV that would be faster than searching the manifest individually for each key?有没有办法“批量查询”输入 CSV 中列出的所有资产的清单,这比单独搜索每个键的清单更快? Or should I use a tree or hashtable or something else I heard mentioned in other SE threads?或者我应该使用树或哈希表或我在其他 SE 线程中提到的其他东西? I don't know anything about the performance implications of any of these.我对这些中的任何一个的性能影响一无所知。

I can format the manifest as needed when it's input (it's just copy-pasted into a GUI), so I guess I could iterate over the entire manifest when it's input and make a hashtable of key:line pairs and then search that ?我可以在输入时根据需要格式化清单(它只是复制粘贴到 GUI 中),所以我想我可以在输入时迭代整个清单并制作 key:line 对的哈希然后搜索? Or I could turn it into a 2D array and just search the specified index?或者我可以把它变成一个二维数组,只搜索指定的索引? Those are all I can think of.我能想到的就这些了。

Problem is, I don't know how much time computer operations like that take, and if that would just waste time or actually improve performance.问题是,我不知道这样的计算机操作需要多少时间,以及这是否会浪费时间或实际上会提高性能。

Ps I'm using Java for this currently since it's all I know, but if another language would be faster then I'm all ears. Ps 我目前正在使用 Java,因为这就是我所知道的,但如果另一种语言会更快,那么我会全神贯注。

The simple solution will be creating a HashMap , iterating through one of the files and add each line of that file to the HashMap (with corresponding key and value), then iterate through the other one and see if the created HashMap contains the key, if yes add the data to another HashMap , then after iteration return the second HashMap .简单的解决方案是创建一个HashMap ,遍历其中一个文件并将该文件的每一行添加到HashMap (具有相应的键和值),然后遍历另一个并查看创建的HashMap包含键是的,将数据添加到另一个HashMap ,然后在迭代后返回第二个HashMap

Imagine we have test1.csv file with the content such key,name,family as below:假设我们有test1.csv文件key,name,family其内容如下:

5000,ehsan,tashkhisi
2,ali,lllll
3,amel,lllll
1,azio,skkk

And test2.csv file with the content such key,status like below:test2.csv文件key,status内容如下:

1000,status1
1,status2
5000,status3
4000,status4
4001,status1
4002,status3
5,status1

We want to have output like this:我们想要这样的 output :

1 -> status2
5000 -> status3

Simple code will be like below:简单的代码如下:

Java 8 Stream: Java 8 Stream:

private static Map<String, String> findDataInTwoFilesJava8() throws IOException {
    Map<String, String> map =
            Files.lines(Paths.get("/tmp/test2.csv")).map(a -> a.split(","))
                    .collect(Collectors.toMap((a -> a[0]), (a -> a[1])));
    return Files.lines((Paths.get("/tmp/test1.csv"))).map(a -> a.split(","))
            .filter(a -> map.containsKey(a[0]))
            .collect(Collectors.toMap(a -> a[0], a -> map.get(a[0])));
}

Simple Java:简单Java:

private static Map<String, String> findDataInTwoFiles() throws IOException {
    String line;
    Map<String, String> map = new HashMap<>();
    BufferedReader br = new BufferedReader(new FileReader("/tmp/test2.csv"));
    while ((line = br.readLine()) != null) {
        String[] lienData = line.split(",");
        map.put(lienData[0], lienData[1]);
    }
    Map<String, String> resultMap = new HashMap<>();
    br = new BufferedReader(new FileReader("/tmp/test1.csv"));
    while ((line = br.readLine()) != null) {
        String key = line.split(",")[0];
        if(map.containsKey(key))
            resultMap.put(key, map.get(key));
    }
    return resultMap;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM