[英]OutOfMemoryError while parsing CSV
I have one huge csv file (500MB) and 400k records in it 我有一个巨大的csv文件(500MB)和40万条记录
id, name, comment, text
1, Alex, Hello, I believe in you
Column text consist much information and sentences. 列文本包含很多信息和句子。 I want to get this column("Text"), replace all non-alphabetic symbols to " " and sort it in reverse order with from most frequent words in column "Text" to most infrequent with limit 1000. This is how it looks like.
我要获取此列(“文本”),将所有非字母符号替换为“”,并以相反的顺序对其进行排序,从“文本”列中的最常用单词到限制为1000的最不频繁。这就是它的样子。 I'm using CsvReader library
我正在使用CsvReader库
CsvReader doc = new CsvReader("My CSV Name");
doc.readHeaders();
try {
List<String> listWords = new ArrayList<>();
while (doc.readRecord()) {
listWords.addAll(Arrays.asList(doc.get("Text"/*my column name*/).replaceAll("\\P{Alpha}", " ").toLowerCase().trim().split("[ ]+")));
}
Map<String, Long> sortedText = listWords.stream()
.collect(groupingBy(chr -> chr, counting()))
.entrySet().stream()
.sorted(Map.Entry.comparingByValue(Collections.reverseOrder()))
.limit(1000)
.collect(Collectors.toMap(
Map.Entry::getKey,
Map.Entry::getValue,
(e1, e2) -> e1,
LinkedHashMap::new
));
sortedText.forEach((k, v) -> System.out.println("Word: " + k + " || " + "Count: " + v));
doc.close();
} catch (IOException e) {
e.printStackTrace();
} finally {
doc.close();
}
After running I have out of memory error that my GC exceeded. 运行后,我的内存超出了我的GC超出的错误。 How to do it best?
如何做到最好? I can't increase my heap size, I just need to work with default settings
我无法增加堆大小,我只需要使用默认设置
A suggestion for the problem: Instead of adding all words in the listWords
, try to make the accounting of words by each CSV line processed. 该问题的建议:与其在
listWords
中添加所有单词, listWords
通过处理的每个CSV行对单词进行计费。
The code would be something like this: 该代码将是这样的:
CsvReader doc = null;
try {
doc = new CsvReader(""My CSV Name");
doc.readHeaders();
Map<String, Long> mostFrequent = new HashMap<String, Long>();
while (doc.readRecord()) {
Arrays.asList(doc.get("text"/*my column name*/).replaceAll("\\P{Alpha}", " ").toLowerCase().trim().split("[ ]+")).
stream().forEach(word -> {
if (mostFrequent.containsKey(word)) {
mostFrequent.put(word, mostFrequent.get(word) + 1);
}
else {
mostFrequent.put(word, 1l);
}
});
}
Map<String, Long> sortedText = mostFrequent.entrySet().stream()
.sorted(Map.Entry.<String, Long>comparingByValue().reversed())
.limit(1000)
.collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue,
(e1, e2) -> e1, LinkedHashMap::new));
sortedText.forEach((k, v) -> System.out.println("Word: " + k + " || " + "Count: " + v));
doc.close();
} catch (IOException e) {
e.printStackTrace();
} finally {
doc.close();
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.