简体   繁体   English

如何处理大的csv文件或分块读取大的CSV文件

[英]How to handle processing large csv file or read large CSV file in chunks

I have very large csv files that I'm trying to iterate through. 我有非常大的csv文件,我想对其进行遍历。 I'm using opencsv and I'd like to use CsvToBean so that I can dynamically set the column mappings from a database. 我正在使用opencsv,并且想使用CsvToBean,以便可以从数据库中动态设置列映射。 The question I have is how to do this without grabbing the entire file and throwing it into a list. 我的问题是如何在不获取整个文件并将其放入列表的情况下执行此操作。 I'm trying to prevent memory errors. 我正在尝试防止内存错误。

I'm currently passing the entire result set into a list like so. 我目前正在将整个结果集传递到这样的列表中。

List<MyOption> myObjects = csv.parse(strat, getReader("file.txt"));

for (MyObject myObject : myObjects) {
    System.out.println(myObject);
}

But I found this iterator method and I'm wondering if this will just iterate each row rather than the entire file at once? 但是我发现了这个迭代器方法,我想知道这是否只是迭代每一行而不是一次遍历整个文件?

Iterator myObjects = csv.parse(strat, getReader("file.txt")).iterator();

while (myObjects.hasNext()) {
    MyObject myObject = (MyObject) myObjects.next();
    System.out.println(myObject);
}

So my question is what is the difference between Iterator and list? 所以我的问题是Iterator和list有什么区别?

The enhanced for loop ( for (MyObject myObject : myObjects) ) is implemented using the Iterator (it requires that the instance returned by csv.parse(strat, getReader("file.txt")) implements the Iterable interface, which contains an iterator() method that returns an Iterator ), so there's no performance difference between the two code snippets. 增强的for循环( for (MyObject myObject : myObjects) )是使用Iterator实现的(它要求csv.parse(strat, getReader("file.txt"))返回的实例csv.parse(strat, getReader("file.txt"))实现Iterable接口,该接口包含一个iterator()返回Iterator方法),因此这两个代码段之间没有性能差异。

PS PS

In the second snippet, don't use the raw Iterator type, Use Iterator<MyObject> : 在第二个片段中,不要使用原始的Iterator类型,请使用Iterator<MyObject>

Iterator<MyObject> myObjects = csv.parse(strat, getReader("file.txt")).iterator();

while (myObjects.hasNext()) {
    MyObject myObject = myObjects.next();
    System.out.println(myObject);
}

Reading a large csv file at once is not a good solution. 一次读取一个较大的csv文件不是一个好的解决方案。 Best way to read the csv file in chunks. 批量读取csv文件的最佳方法。 You can have multiple threads one to read the data from the file and few other threads to perform the business logic. 您可以有多个线程来读取文件中的数据,而有几个其他线程来执行业务逻辑。 More details to read CSV data in chunks is here How to parse chunk by chunk a large CSV file and bulk insert to a database and have multiple threds solution here 更多详细信息,请参见此处读取块中的CSV数据。 如何逐块分析大CSV文件并批量插入数据库,在此处具有多个解决方案

"what is the difference between Iterator and list?" “迭代器和列表有什么区别?”

A List is a data structure that gives the user functionalities like get(), toArray() etc. 列表是一种数据结构,可提供诸如get(),toArray()等用户功能。

An iterator only can allow the user to navigate through a data-structure provided the data structure implements Iterator interface (which all the data structures do) 迭代器仅允许用户浏览数据结构,前提是该数据结构实现了Iterator接口(所有数据结构都可以这样做)

so List<MyOption> myObjects = csv.parse(strat, getReader("file.txt")); 所以List<MyOption> myObjects = csv.parse(strat, getReader("file.txt")); physically stores the data in myObjects 将数据物理存储在myObjects中

and Iterator myObjects = csv.parse(strat, getReader("file.txt")).iterator(); Iterator myObjects = csv.parse(strat, getReader("file.txt")).iterator(); just uses the iterator functionality of csv.parse 只是使用csv.parse的迭代器功能

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM