简体   繁体   English

解析CSV时出现OutOfMemoryError

[英]OutOfMemoryError while parsing CSV

I have one huge csv file (500MB) and 400k records in it 我有一个巨大的csv文件(500MB)和40万条记录

id, name, comment, text
1, Alex, Hello, I believe in you

Column text consist much information and sentences. 列文本包含很多信息和句子。 I want to get this column("Text"), replace all non-alphabetic symbols to " " and sort it in reverse order with from most frequent words in column "Text" to most infrequent with limit 1000. This is how it looks like. 我要获取此列(“文本”),将所有非字母符号替换为“”,并以相反的顺序对其进行排序,从“文本”列中的最常用单词到限制为1000的最不频繁。这就是它的样子。 I'm using CsvReader library 我正在使用CsvReader库

CsvReader doc = new CsvReader("My CSV Name");
        doc.readHeaders();
        try {
            List<String> listWords = new ArrayList<>();
            while (doc.readRecord()) {
                listWords.addAll(Arrays.asList(doc.get("Text"/*my column name*/).replaceAll("\\P{Alpha}", " ").toLowerCase().trim().split("[ ]+")));
            }

            Map<String, Long> sortedText = listWords.stream()
                    .collect(groupingBy(chr -> chr, counting()))
                    .entrySet().stream()
                    .sorted(Map.Entry.comparingByValue(Collections.reverseOrder()))
                    .limit(1000)
                    .collect(Collectors.toMap(
                            Map.Entry::getKey,
                            Map.Entry::getValue,
                            (e1, e2) -> e1,
                            LinkedHashMap::new
                    ));
            sortedText.forEach((k, v) -> System.out.println("Word: " + k + " || " + "Count: " + v));
            doc.close();
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            doc.close();
        }

After running I have out of memory error that my GC exceeded. 运行后,我的内存超出了我的GC超出的错误。 How to do it best? 如何做到最好? I can't increase my heap size, I just need to work with default settings 我无法增加堆大小,我只需要使用默认设置

A suggestion for the problem: Instead of adding all words in the listWords , try to make the accounting of words by each CSV line processed. 该问题的建议:与其在listWords中添加所有单词, listWords通过处理的每个CSV行对单词进行计费。

The code would be something like this: 该代码将是这样的:

CsvReader doc = null;

try {

    doc = new CsvReader(""My CSV Name");
    doc.readHeaders();

    Map<String, Long> mostFrequent = new HashMap<String, Long>();

    while (doc.readRecord()) {

        Arrays.asList(doc.get("text"/*my column name*/).replaceAll("\\P{Alpha}", " ").toLowerCase().trim().split("[ ]+")).
        stream().forEach(word -> {

            if (mostFrequent.containsKey(word)) {
                mostFrequent.put(word, mostFrequent.get(word) + 1);  
            }
            else {
                mostFrequent.put(word, 1l);
            }
        });
    }

    Map<String, Long> sortedText = mostFrequent.entrySet().stream()
        .sorted(Map.Entry.<String, Long>comparingByValue().reversed())
        .limit(1000)
        .collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue,
                (e1, e2) -> e1, LinkedHashMap::new));

    sortedText.forEach((k, v) -> System.out.println("Word: " + k + " || " + "Count: " + v));

    doc.close();

} catch (IOException e) {
    e.printStackTrace();
} finally {
    doc.close();
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM