Handling a Folder of Large CSV Files

Question

I have a large folder of large CSV files (around 25,000 files in the folder and it will further increase), that is, almost all files have more rows than the row limit for Excel (the limit was 1 million something, I guess). All of these CSV files have 5 elements delimited by commas in every row with varying number of rows (info) in all files.

One CSV File:
a1,b1,c1,d1,e1
a2,b2,c2,d2,e2
.
.
.
a3152685,b3152685,c3152685,d3152685,e3152685

My Reference File:
x1,y1
x2,y2
x3,y3
.
.
.
x397,y397

I will, essentially, need to access only some of these rows (around 400) from every CSV file based on my reference file. Wherever I can match the xy couple with the ab couple in any CSV file, I will save the a,b,c,d,e row with the CSV file's title to somewhere else, preferably an Excel file but I'm open to ideas.

I can work with Matlab, Python 2.7, MS Access (Converting CSV files to a Database files seemed like a good idea if I didn't have to do it for every single file - is there a batch version to do it) or MS Excel. I have never done any VBA stuff, but if you have some VBA solution to this problem, I am also open to listening to that.

Let me know if you need any more clarification in case I wasn't clear enough.

Answer 1

You can find the limits of office products here

Matlab is good for working with this large files and large sets of files. Version 2014 has a lot of improvements for that introucing datastore for csv, now also works pretty well with excel files.

Take a look at this tutorial:

http://blogs.mathworks.com/loren/2014/12/03/reading-big-data-into-matlab/

I have a 3 csv files (file[1-3].csv) containing this:

a1,b1,c1,d1,e1
a2,b2,c2,d2,e2
a3,b3,c3,d3,e3
a4,b4,c4,d4,e4
a5,b5,c5,d5,e5
a6,b6,c6,d6,e6
a7,b7,c7,d7,e7
a8,b8,c8,d8,e8
a9,b9,c9,d9,e9
a10,b10,c10,d10,e10

and a file varnames for the names of the columns:

ABCDE

Let's read the files:

>> datafile = 'csv-files/file1.csv';
>> headerfile = 'csv-files/varnames.txt'

>> fileID = fopen(headerfile);
>> varnames = textscan(fileID,'%s');
>> varnames = varnames{:};

ds = datastore(datafile,'ReadVariableNames',false);

>> ds.VariableNames = varnames


ds = 

  TabularTextDatastore with properties:

                      Files: {
                             '/home/anquegi/learn/matlab/stackoverflow/csv-files/file1.csv'
                             }
               FileEncoding: 'UTF-8'
          ReadVariableNames: false
              VariableNames: {'A', 'B', 'C' ... and 2 more}

  Text Format Properties:
             NumHeaderLines: 0
                  Delimiter: ','
               RowDelimiter: '\r\n'
             TreatAsMissing: ''
               MissingValue: NaN

  Advanced Text Format Properties:
            TextscanFormats: {'%q', '%q', '%q' ... and 2 more}
         ExponentCharacters: 'eEdD'
               CommentStyle: ''
                 Whitespace: ' \b\t'
    MultipleDelimitersAsOne: false

  Properties that control the table returned by preview, read, readall:
      SelectedVariableNames: {'A', 'B', 'C' ... and 2 more}
            SelectedFormats: {'%q', '%q', '%q' ... and 2 more}
                   ReadSize: 20000 rows


>> preview(ds)

ans = 

     A       B       C       D       E  
    ____    ____    ____    ____    ____

    'a1'    'b1'    'c1'    'd1'    'e1'
    'a2'    'b2'    'c2'    'd2'    'e2'
    'a3'    'b3'    'c3'    'd3'    'e3'
    'a4'    'b4'    'c4'    'd4'    'e4'
    'a5'    'b5'    'c5'    'd5'    'e5'
    'a6'    'b6'    'c6'    'd6'    'e6'
    'a7'    'b7'    'c7'    'd7'    'e7'
    'a8'    'b8'    'c8'    'd8'    'e8'

If we look at the parameter ReadSize we take is ReadSize: 20000 rows, so matlab read every time 20000 rows and you can process. Since the data there are only 10 rows I will change it to three:

>> ds.ReadSize=3

ds = 

  TabularTextDatastore with properties:

                      Files: {
                             '/home/anquegi/learn/matlab/stackoverflow/csv-files/file1.csv'
                             }
               FileEncoding: 'UTF-8'
          ReadVariableNames: false
              VariableNames: {'A', 'B', 'C' ... and 2 more}

  Text Format Properties:
             NumHeaderLines: 0
                  Delimiter: ','
               RowDelimiter: '\r\n'
             TreatAsMissing: ''
               MissingValue: NaN

  Advanced Text Format Properties:
            TextscanFormats: {'%q', '%q', '%q' ... and 2 more}
         ExponentCharacters: 'eEdD'
               CommentStyle: ''
                 Whitespace: ' \b\t'
    MultipleDelimitersAsOne: false

  Properties that control the table returned by preview, read, readall:
      SelectedVariableNames: {'A', 'B', 'C' ... and 2 more}
            SelectedFormats: {'%q', '%q', '%q' ... and 2 more}
                   ReadSize: 3 rows

>> reset(ds)
while hasdata(ds)
      T = read(ds);
      T.A
end

ans = 

    'a1'
    'a2'
    'a3'


ans = 

    'a4'
    'a5'
    'a6'


ans = 

    'a7'
    'a8'
    'a9'


ans = 

    'a10'

Then the T variable is a table you can write it where you want: note that every time you read(ds) it move the number of lines assigned by readsie, this parameter can be rows, or files

>> reset(ds)
>> T = read(ds);
>> T

T = 

     A       B       C       D       E  
    ____    ____    ____    ____    ____

    'a1'    'b1'    'c1'    'd1'    'e1'
    'a2'    'b2'    'c2'    'd2'    'e2'
    'a3'    'b3'    'c3'    'd3'    'e3'

>> writetable(T,'mySpreadsheet','FileType','spreadsheet')
>> reset(ds)

Answer 2

Just in case anybody needs an answer, I kind of handled it with MATLAB.

datastore function in MATLAB is what I was looking for.

ds = datastore(MyReferenceFile);
TableExtracted = readall(ds);

Then the rest was just find(ismember) taking charge.

There is also the batch lookup feature (size of the batch is assigned at ReadSize ) of datastore ; however, it was defaulted to 20000 and I guess that was also the limit. It was too slow for my liking so I resorted to readall and it was still pretty fast.

Answer 3

This may be off topic, but I think you should consider SQL Server and SSIS. You can easily loop through all files in a folder, load all into SQL Server, and then move the files out of the folder. Next time files are dumped into your folder, run the process again, on these new files. See the link below for all details.

https://www.mssqltips.com/sqlservertip/2874/loop-through-flat-files-in-sql-server-integration-services/

Or, use pure SQL to do the work.

--BULK INSERT MULTIPLE FILES From a Folder 

--a table to loop thru filenames drop table ALLFILENAMES
CREATE TABLE ALLFILENAMES(WHICHPATH VARCHAR(255),WHICHFILE varchar(255))

--some variables
declare @filename varchar(255),
        @path     varchar(255),
        @sql      varchar(8000),
        @cmd      varchar(1000)


--get the list of files to process:
SET @path = 'C:\Dump\'
SET @cmd = 'dir ' + @path + '*.csv /b'
INSERT INTO  ALLFILENAMES(WHICHFILE)
EXEC Master..xp_cmdShell @cmd
UPDATE ALLFILENAMES SET WHICHPATH = @path where WHICHPATH is null


--cursor loop
declare c1 cursor for SELECT WHICHPATH,WHICHFILE FROM ALLFILENAMES where WHICHFILE like '%.csv%'
open c1
fetch next from c1 into @path,@filename
While @@fetch_status <> -1
  begin
  --bulk insert won't take a variable name, so make a sql and execute it instead:
   set @sql = 'BULK INSERT Temp FROM ''' + @path + @filename + ''' '
       + '     WITH ( 
               FIELDTERMINATOR = '','', 
               ROWTERMINATOR = ''\n'', 
               FIRSTROW = 2 
            ) '
print @sql
exec (@sql)

  fetch next from c1 into @path,@filename
  end
close c1
deallocate c1


--Extras

--delete from ALLFILENAMES where WHICHFILE is NULL
--select * from ALLFILENAMES
--drop table ALLFILENAMES

From here:

Import Multiple CSV Files to SQL Server from a Folder

Access will not handle this quantity of date, and as you already know, Excel won't even come close.

One more thing to consider is to use R, which is totally free and very fast.

We often encounter situations where we have data in multiple files, at different frequencies and on different subsets of observations, but we would like to match them to one another as completely and systematically as possible. In R, the merge() command is a great way to match two data frames together.

Just read the two data frames into R

mydata1 = read.csv(path1, header=T)
mydata2 = read.csv(path2, header=T)

Then, merge

myfulldata = merge(mydata1, mydata2)

As long as mydata1 and mydata2 have at least one common column with an identical name (that allows matching observations in mydata1 to observations in mydata2), this will work like a charm. It also takes three lines.

What if I have 20 files with data that I want to match observation-to-observation? Assuming they all have a common column that allows merging, I would still have to read 20 files in (20 lines of code) and merge() works two-by-two… so I could merge the 20 data frames together with 19 merge statements like this:

mytempdata = merge(mydata1, mydata2)
mytempdata = merge(mytempdata, mydata3)
.
.
.
mytempdata = merge(mytempdata, mydata20)

That's tedious. You may be looking for a simpler way. If you are, I wrote a function to solve your woes called multmerge().* Here's the code to define the function:

multmerge = function(mypath){
filenames=list.files(path=mypath, full.names=TRUE)
datalist = lapply(filenames, function(x){read.csv(file=x,header=T)})
Reduce(function(x,y) {merge(x,y)}, datalist)

After running the code to define the function, you are all set to use it. The function takes a path. This path should be the name of a folder that contains all of the files you would like to read and merge together and only those files you would like to merge. With this in mind, I have two tips:

Before you use this function, my suggestion is to create a new folder in a short directory (for example, the path for this folder could be “C://R//mergeme“) and save all of the files you would like to merge in that folder. In addition, make sure that the column that will do the matching is formatted the same way (and has the same name) in each of the files. Suppose you saved your 20 files into the mergeme folder at “C://R//mergeme” and you would like to read and merge them. To use my function, you use the following syntax:

mymergeddata = multmerge(“C://R//mergeme”)

After running this command, you have a fully merged data frame with all of your variables matched to each other. Of course, most of the details in matching and merging data come down to making sure that the common column is specified correctly, but given that, this function can save you a lot of typing.

Once everything is merged into 1 data frame, export that to a text file or a CSV file, and bulk load that into SQL Server.

Handling a Folder of Large CSV Files

Question

3 answers

solution1
1 2017-02-16 08:52:55

solution2
1 ACCPTED 2017-03-18 13:33:29

solution3
0 2017-03-02 17:05:11

Handling a Folder of Large CSV Files

Question

3 answers

solution1 1 2017-02-16 08:52:55

solution2 1 ACCPTED 2017-03-18 13:33:29

solution3 0 2017-03-02 17:05:11

solution1
1 2017-02-16 08:52:55

solution2
1 ACCPTED 2017-03-18 13:33:29

solution3
0 2017-03-02 17:05:11