Extracting certain columns from a CSV file in C++

Question

I would like to know how I can extract / skip certain columns such as age and weight from a CSV file in C++.

~~Does it make more sense to extract the desired information after I loaded the entire csv file (if memory is not a problem)?~~

EDIT : If possible, I would like to have a reading, printing and modification part.

If possible, I want to use only the STL. The content of my test csv file looks as follows:

*test.csv*

name;age;weight;height;test
Bla;32;1.2;4.3;True
Foo;43;2.2;5.3;False
Bar;None;3.8;2.4;True
Ufo;32;1.5;5.4;True

I load the test.csv file with the following C++ program that prints the file's content on the screen:

#include <iostream>
#include <vector>
#include <string>
#include <iomanip>
#include <fstream>
#include <sstream>

void readCSV(std::vector<std::vector<std::string> > &data, std::string filename);
void printCSV(const std::vector<std::vector<std::string>> &data);

int main(int argc, char** argv) {
    std::string file_path = "./test.csv";
    std::vector<std::vector<std::string> > data;
    readCSV(data, file_path);
    printCSV(data);
    return 0;
}

void readCSV(std::vector<std::vector<std::string> > &data, std::string filename) {
    char delimiter = ';';
    std::string line;
    std::string item;
    std::ifstream file(filename);
    while (std::getline(file, line)) {
        std::vector<std::string> row;
        std::stringstream string_stream(line);
        while (std::getline(string_stream, item, delimiter)) {
            row.push_back(item);
        }
        data.push_back(row);
    }
    file.close();
}

void printCSV(const std::vector<std::vector<std::string> > &data) {
    for (std::vector<std::string> row: data) {
        for (std::string item: row) {
            std::cout << item << ' ';
        }
        std::cout << std::endl;
    }
}

Any assistance you can provide would be greatly appreciated.

Answer 1

Basically I answered this question already in a similar thread. But anyway, I will show a ready to use solution with a different approach and some explanation here.

One hint: You should make yourself more familiar with object oriented programming. And think over your design. In your read and write function you create a unneccessary dependency to a file or to std::cout - So, you should not handover a file name and then open the file in the function, but, use streams . Because, in the function that I created, using the C++ IO facilities, it doesn't matter, if we read from a file or a std::istringstream or write to std::cout or a file stream.

All will be handled via the (overloaded) extractor and inserter operators.

So, and because I wanted the code a little bit more flexible, I made my struct a template, to be able to put in the selected Columns and reuse the same struct for other column combinations.

If you want to have fixed selected columns then you can delete the line with template and can replace std::vector<size_t> selectedFields{ {Colums...} }; with std::vector<size_t> selectedFields{ {1,2} };

Later we use a using for the template to allow easier handling and understanding:

// Define Dataype for selected columns age and weight
using AgeAndWeight = SelectedColumns<1, 2>;

OK, let's first see the source code and then try to understand.

#include <iostream>
#include <string>
#include <vector>
#include <regex>
#include <fstream>
#include <initializer_list>
#include <iterator>
#include <algorithm>

std::regex re{ ";" };

// Proxy for reading an splitting a line and extracting certain fields and some simple output
template<size_t ... Colums>
struct SelectedColumns {
    std::vector<std::string> data{};
    std::vector<size_t> selectedFields{ {Colums...} };

    // Overwrite extractor operator
    friend std::istream& operator >> (std::istream& is, SelectedColumns& sl) {

        // Read a complete line and check, if it could be read
        if (std::string line{}; std::getline(is, line)) {

            // Now split the line into tokens
            std::vector tokens(std::sregex_token_iterator(line.begin(), line.end(), re, -1), {});

            // Clear old data
            sl.data.clear();

            // So, and now copy the selected columns into our data vector
            for (const size_t& column : sl.selectedFields) 
                if (column < tokens.size()) sl.data.push_back(tokens[column]);
        }
        return is;
    }
    // Simple extractor
    friend std::ostream& operator << (std::ostream & os, const SelectedColumns & sl) {
        std::copy(sl.data.begin(), sl.data.end(), std::ostream_iterator<std::string>(os, "\t"));
        return os;
    }
};

// Define Dataype for selected columns age and weight
using AgeAndWeight = SelectedColumns<1U, 2U>;

const std::string fileName{ "./test.csv" };

int main() {

    // Open the csv file and check, if it is open
    if (std::ifstream csvFileStream{ fileName }; csvFileStream) {

        // Read complete csv file and extract age and weight columns        
        std::vector sc(std::istream_iterator<AgeAndWeight>(csvFileStream), {});

        // Now all data is available in this vector  sc    Do something
        sc[3].data[0] = "77";

        // Show some debug out put
        std::copy(sc.begin(), sc.end(), std::ostream_iterator<AgeAndWeight>(std::cout, "\n"));

        // By the way, you could also write the 2 lines above in one line.
        //std::copy(std::istream_iterator<AgeAndWeight>(csvFileStream), {}, std::ostream_iterator<AgeAndWeight>(std::cout, "\n"));

    }
    else std::cerr << "\n*** Error: Could not open source file\n\n";
    return 0;
}

One major task here is to split a line with CSV Data into its tokens. Let us have a look at this.

Splitting a string into tokens:

What do people expect from the function, when they read

getline?

Most people would say, Hm, I guess it will read a complete line from somewhere. And guess what, that was the basic intention for this function. Read a line from a stream and put it into a string.

But, as you can see here std::getline has some additional functionality.

And this lead to a major misuse of this function for splitting up std::string s into tokens.

Splitting strings into tokens is a very old task. In very early C there was the function strtok , which still exists, even in C++. Here std::strtok . Please see the std::strtok -example

std::vector<std::string> data{};
for (char* token = std::strtok(const_cast<char *>(line.data()), ","); token != nullptr; token = std::strtok(nullptr, ",")) 
    data.push_back(token);

Simple, right?

But because of the additional functionality of std::getline is has been heavily misused for tokenizing strings. If you look on the top question/answer regarding how to parse a CSV file (please see here ), then you will see what I mean.

People are using std::getline to read a text line, a string, from the original stream, then stuffing it into an std::istringstream and use std::getline with delimiter again to parse the string into tokens. Weird.

But, since many many years, we have a dedicated, special function for tokenizing strings, especially and explicitly designed for that purpose. It is the

std::sregex_token_iterator

And since we have such a dedicated function, we should simply use it.

This thing is an iterator. For iterating over a string, hence the function name is starting with an s. The begin part defines, on what range of input we shall operate, the end part is default constructed, and then there is a std::regex for what should be matched / or what should not be matched in the input string. The type of matching strategy is given with last parameter.

0 --> give me the stuff that I defined in the regex and (optional)
-1 --> give me that what is NOT matched based on the regex.

We can use this iterator for storing the tokens in a std::vector . The std::vector has a range constructor, which takes 2 iterators as parameter, and copies the data between the first iterator and 2nd iterator to the std::vector. The statement

std::vector tokens(std::sregex_token_iterator(s.begin(), s.end(), re, -1), {});

defines a variable “tokens” as a std::vector and uses the so called range-constructor of the std::vector. Please note: I am using C++17 and can define the std::vector without template argument. The compiler can deduce the argument from the given function parameters. This feature is called CTAD ("class template argument deduction").

Additionally, you can see that I do not use the "end()"-iterator explicitly.

This iterator will be constructed from the empty brace-enclosed default initializer with the correct type, because it will be deduced to be the same as the type of the first argument due to the std::vector constructor requiring that.

You can read any number of tokens in a line and put it into the std::vector

But you can do even more. You can validate your input. If you use 0 as last parameter, you define a std::regex that even validates your input. And you get only valid tokens.

Overall, the usage of a dedicated functionality is superior over the misused std::getline and people should simple use it.

Some people complain about the function overhead, and, they are right, but how many of them are using big data. And even then, the approach would be probably then to use string.find and string.substring or std::stringviews or whatever.

So, now to further topics.

In the extractor, we first read a complete line from the source stream and check, if that worked. Or, if we have and end of file or any other error.

Then we tokenize that just read string as described above.

And then, we will copy only selected columns from the tokens into our resulting data. This is done in a simple for loop. Here we also check the boundaries, because somebody could specify invalid selected columns, or, a line could have less tokens than expected.

So the body of the extractor is vey simple. Just 5 line of code. . .

Then, again,

You should start using object-oriented features in C++. In C++ you can put data and methods that operate on these data into one object. The reason is that the outside world should not care about objects internals. For example, your readCSV and printCSV function should be part of a struct (or class).

And as next step, we will not use your “read” and “print” functions. We will use the dedicated function for Stream-IO, the extractor operator >> and the inserter operator <<. And we will overwrite the standard IO-functions in our struct.

In function main we will open the the source file and check, if the open was successful. BTW. All input output functions shall be checked, if they were successful.

Then, we use the next iterator, the std::istream_iterator . And this together with our “AgeAndWeight”-type and the input file stream. Also here we use CTAD and the default constructed end-iterator. The std::istream_iterator will repeatedly call the AgeAndWeight extractor operator, until all lines of the source file are read.

For output, we will use the std::ostream_iterator . This will call the inserter operator for "AgeAndWeight" until all data are written.

Extracting certain columns from a CSV file in C++

Question

1 answers

solution1
1 2020-04-04 14:04:18

Extracting certain columns from a CSV file in C++

Question

1 answers

solution1 1 2020-04-04 14:04:18

solution1
1 2020-04-04 14:04:18