在Java中搜索大型csv的最佳/高效方法

Question

我有一個帶有1.5K條目的較大的csv文件。 每個條目代表一個具有名稱，緯度和經度的全球城市。 在Java中搜索csv的最佳最快方法是什么？ 我想用所有條目填充ArrayList，但我認為它很慢（除非我錯了）。 沒有這個文件不會增加大小，幾乎是100KB。 我希望能夠輸入城市名稱並更新搜索結果； 但是我可以自己弄清楚。

Answer 1

一個約1 MB的1.5K條目文件應花費數十毫秒。 一個1 GB的文件可能會花費數十秒，因此值得為該文件保存一個索引，以免每次都要重新讀取它。

您可以加載到地圖中以按name索引

您可以通過NavigableMap添加經緯度索引。 這將加快按位置的查找。

一次加載文件需要一點時間，但是每次從磁盤讀取文件的速度要慢得多。

順便說一句，您可以擁有100 TB的數據和數萬億的行，要在Java中使用此數據，就必須發揮創造力。

簡而言之，如果它比您的內存少得多，則它是相對較小的文件。

Answer 2

1.5K行的城市名稱，緯度和經度並不是一個很大的文件，它是一個很小的文件 ，只要您不做任何完全不合理的事情（例如閱讀），那么閱讀它就無關緊要使用無緩沖的I / O每次一次一個字節。

因此，我要做的就是繼續讀取文件一次，構造行對象，並將它們添加到ArrayList 。 這可能足夠快，您可以在每次搜索后丟棄該列表，並在每次要搜索時重新加載它。 或者，如果您不介意占用一些內存，那么您當然想保留它。

但是無論如何，如果出於某種不可思議的原因最終導致性能問題，我只會擔心性能。 您尚未告訴我們您所生產產品的性能要求是什么。 沒有性能要求，也沒有度量標准，所有談論性能通常都是不合理的擔心，並且往往會導致過早的優化。

Answer 3

處理大型文本內容時，可能需要進行一些文本操作。

注意字符串連接。 通常使用StringBuffer或StringBuilder連接字符串。

Answer 4

最快的CSV解析器將是univocity-parsers 。 有很多方法可以解決此問題，以下方法足夠靈活，可以以不錯的速度為您提供結果。 以下示例使用具有130萬行的150MB CSV文件，並在大約1秒鍾內運行搜索：

首先，創建一個`RowProcessor`

在這里，我們擴展了庫附帶的現有庫之一，

public class CsvSearch extends RowListProcessor {
    //value to be searched for
    private final String stringToMatch;

    //name of column to match (if you don't have headers)
    private final String columnToMatch;

    //position of column to match
    private int indexToMatch = -1;

    public CsvSearch(String columnToMatch, String stringToMatch){
        this.columnToMatch = columnToMatch;
        this.stringToMatch = stringToMatch.toLowerCase(); //lower case to make the search case-insensitive
    }

    public CsvSearch(int columnToMatch, String stringToMatch){
        this(stringToMatch, null);
        this.indexToMatch = columnToMatch;
    }

    @Override
    public void rowProcessed(String[] row, ParsingContext context) {
        if(indexToMatch == -1) {
            //initializes the index to match
            indexToMatch = context.indexOf(columnToMatch);
        }

        String value = row[indexToMatch];
        if(value != null && value.toLowerCase().contains(stringToMatch)) {
            super.rowProcessed(row, context); // default behavior of the RowListProcessor: add the row into a List.
        }
        // else skip the row.
    }
}

配置解析器並運行

// let's measure the time roughly
long start = System.currentTimeMillis();

CsvParserSettings settings = new CsvParserSettings();
settings.setHeaderExtractionEnabled(true); //extract headers from the first row

CsvSearch search = new CsvSearch("City", "Paris");

//We instruct the parser to send all rows parsed to your custom RowProcessor.
settings.setProcessor(search);

//Finally, we create a parser
CsvParser parser = new CsvParser(settings);

//And parse! All rows are sent to your custom RowProcessor (CsvSearch)
//I'm using a 150MB CSV file with 1.3 million rows.
parser.parse(new File("/tmp/data/worldcitiespop.txt"));

//get the collected rows from our processor
List<String[]> results = search.getRows();

//Nothing else to do. The parser closes the input and does everything for you safely. Let's just get the results:
System.out.println("Rows matched: " + results.size());
System.out.println("Time taken: " + (System.currentTimeMillis() - start) + " ms");

這在我的計算機（2015 MacBook Pro）上產生了以下輸出：

Rows matched: 218
Time taken: 997 ms

返回的結果如下所示：

[af, parisang, Parisang, 08, null, 33.180704, 67.470836]
[af, qaryeh-ye bid-e parishan, Qaryeh-ye Bid-e Parishan, 06, null, 33.242727, 63.389834]
[ar, parish, Parish, 01, null, -36.518335, -59.633313]
[at, parisdorf, Parisdorf, 03, null, 48.566667, 15.85]
[au, paris creek, Paris Creek, 05, null, -35.216667, 138.8]
[az, hayi paris, Hayi Paris, 21, null, 40.449626, 46.55542]
[az, hay paris, Hay Paris, 21, null, 40.449626, 46.55542]
[az, rousi paris, Rousi Paris, 21, null, 40.435789, 46.510691]
[az, rrusi paris, Rrusi Paris, 21, null, 40.435789, 46.510691]
[bb, parish land, Parish Land, 01, null, 13.0666667, -59.5166667]
... (and many more)

如果選擇要解析的列，而忽略不需要的內容，則可以進一步提高速度。 只需調用settings.selectFields("City"); 在處理文件之前，指示解析器僅為“ City列生成Strings 。

希望這可以幫助。 披露：我是這個圖書館的作者。 它是開源且免費的（Apache v2.0許可證）

在Java中搜索大型csv的最佳/高效方法

問題描述

4 個解決方案

解決方案1
5 已采納 2016-08-28 01:11:43

解決方案2
3 2016-08-28 01:11:58

解決方案3
0 2016-08-28 01:35:50

解決方案4
0 2016-08-28 13:58:59

首先，創建一個`RowProcessor`

配置解析器並運行

在Java中搜索大型csv的最佳/高效方法

問題描述

4 個解決方案

解決方案1 5 已采納 2016-08-28 01:11:43

解決方案2 3 2016-08-28 01:11:58

解決方案3 0 2016-08-28 01:35:50

解決方案4 0 2016-08-28 13:58:59

首先，創建一個RowProcessor

配置解析器並運行

解決方案1
5 已采納 2016-08-28 01:11:43

解決方案2
3 2016-08-28 01:11:58

解決方案3
0 2016-08-28 01:35:50

解決方案4
0 2016-08-28 13:58:59

首先，創建一個`RowProcessor`