從 C++ 中的 CSV 文件中提取某些列

Question

我想知道如何從 C++ 中的 CSV 文件中提取/跳過某些列，例如age和weight 。

~~在我加載整個 csv 文件后提取所需信息是否更有意義（如果 memory 沒有問題）？~~

編輯：如果可能的話，我想要閱讀、打印和修改部分。

如果可能，我只想使用 STL。 我的測試 csv 文件的內容如下所示：

*test.csv*

name;age;weight;height;test
Bla;32;1.2;4.3;True
Foo;43;2.2;5.3;False
Bar;None;3.8;2.4;True
Ufo;32;1.5;5.4;True

我使用以下 C++ 程序加載test.csv文件，該程序在屏幕上打印文件的內容：

#include <iostream>
#include <vector>
#include <string>
#include <iomanip>
#include <fstream>
#include <sstream>

void readCSV(std::vector<std::vector<std::string> > &data, std::string filename);
void printCSV(const std::vector<std::vector<std::string>> &data);

int main(int argc, char** argv) {
    std::string file_path = "./test.csv";
    std::vector<std::vector<std::string> > data;
    readCSV(data, file_path);
    printCSV(data);
    return 0;
}

void readCSV(std::vector<std::vector<std::string> > &data, std::string filename) {
    char delimiter = ';';
    std::string line;
    std::string item;
    std::ifstream file(filename);
    while (std::getline(file, line)) {
        std::vector<std::string> row;
        std::stringstream string_stream(line);
        while (std::getline(string_stream, item, delimiter)) {
            row.push_back(item);
        }
        data.push_back(row);
    }
    file.close();
}

void printCSV(const std::vector<std::vector<std::string> > &data) {
    for (std::vector<std::string> row: data) {
        for (std::string item: row) {
            std::cout << item << ' ';
        }
        std::cout << std::endl;
    }
}

您能提供的任何幫助將不勝感激。

Answer 1

基本上我已經在類似的帖子中回答了這個問題。 但無論如何，我將在這里展示一個采用不同方法和一些解釋的現成解決方案。

一個提示：你應該讓自己更加熟悉面向 object 的編程。 並考慮您的設計。 在您的讀寫 function 中，您創建了對文件或std::cout的不必要的依賴項 - 因此，您不應移交文件名，然后在 function 中打開文件，而是使用streams 。 Because, in the function that I created, using the C++ IO facilities, it doesn't matter, if we read from a file or a std::istringstream or write to std::cout or a file stream.

所有這些都將通過（重載的）提取器和插入器操作符進行處理。

所以，因為我希望代碼更靈活一點，所以我將我的結構設為模板，以便能夠放入選定的列並將相同的結構重用於其他列組合。

如果您想固定選定的列，那么您可以刪除帶有template的行並可以替換std::vector<size_t> selectedFields{ {Colums...} }; 與std::vector<size_t> selectedFields{ {1,2} };

稍后我們對模板使用using以便於處理和理解：

// Define Dataype for selected columns age and weight
using AgeAndWeight = SelectedColumns<1, 2>;

OK，我們先看源碼，再試着理解。

#include <iostream>
#include <string>
#include <vector>
#include <regex>
#include <fstream>
#include <initializer_list>
#include <iterator>
#include <algorithm>

std::regex re{ ";" };

// Proxy for reading an splitting a line and extracting certain fields and some simple output
template<size_t ... Colums>
struct SelectedColumns {
    std::vector<std::string> data{};
    std::vector<size_t> selectedFields{ {Colums...} };

    // Overwrite extractor operator
    friend std::istream& operator >> (std::istream& is, SelectedColumns& sl) {

        // Read a complete line and check, if it could be read
        if (std::string line{}; std::getline(is, line)) {

            // Now split the line into tokens
            std::vector tokens(std::sregex_token_iterator(line.begin(), line.end(), re, -1), {});

            // Clear old data
            sl.data.clear();

            // So, and now copy the selected columns into our data vector
            for (const size_t& column : sl.selectedFields) 
                if (column < tokens.size()) sl.data.push_back(tokens[column]);
        }
        return is;
    }
    // Simple extractor
    friend std::ostream& operator << (std::ostream & os, const SelectedColumns & sl) {
        std::copy(sl.data.begin(), sl.data.end(), std::ostream_iterator<std::string>(os, "\t"));
        return os;
    }
};

// Define Dataype for selected columns age and weight
using AgeAndWeight = SelectedColumns<1U, 2U>;

const std::string fileName{ "./test.csv" };

int main() {

    // Open the csv file and check, if it is open
    if (std::ifstream csvFileStream{ fileName }; csvFileStream) {

        // Read complete csv file and extract age and weight columns        
        std::vector sc(std::istream_iterator<AgeAndWeight>(csvFileStream), {});

        // Now all data is available in this vector  sc    Do something
        sc[3].data[0] = "77";

        // Show some debug out put
        std::copy(sc.begin(), sc.end(), std::ostream_iterator<AgeAndWeight>(std::cout, "\n"));

        // By the way, you could also write the 2 lines above in one line.
        //std::copy(std::istream_iterator<AgeAndWeight>(csvFileStream), {}, std::ostream_iterator<AgeAndWeight>(std::cout, "\n"));

    }
    else std::cerr << "\n*** Error: Could not open source file\n\n";
    return 0;
}

這里的一項主要任務是將帶有 CSV 數據的行拆分為其令牌。 讓我們來看看這個。

將字符串拆分為標記：

人們對 function 有什么期望，當他們閱讀

獲取線路？

大多數人會說，嗯，我想它會從某個地方讀到完整的一行。 猜猜看，這就是這個 function 的基本意圖。 從 stream 中讀取一行並將其放入字符串中。

但是，正如您在此處看到的那樣， std::getline具有一些附加功能。

這導致嚴重濫用此 function 將std::string s 拆分為令牌。

將字符串拆分為標記是一項非常古老的任務。 在很早的 C 中有 function strtok ，即使在 C++ 中仍然存在。 這里std::strtok 。 請參閱std::strtok -example

std::vector<std::string> data{};
for (char* token = std::strtok(const_cast<char *>(line.data()), ","); token != nullptr; token = std::strtok(nullptr, ",")) 
    data.push_back(token);

很簡單，對吧？

但是由於std::getline的附加功能已被嚴重誤用於標記字符串。 如果您查看有關如何解析 CSV 文件的首要問題/答案（請參閱此處），那么您將明白我的意思。

人們正在使用std::getline從原始 ZF7B44CFFAFD5C52223D5498196C8A2E7BZ 中讀取文本行、字符串，然后將其填充到std::istringstream並再次使用帶分隔符的std::getline將字符串解析為標記。 詭異的。

但是，多年來，我們有一個專用的、特殊的 function 用於標記字符串，特別是專門為此目的而設計的。 它是

std::sregex_token_iterator

既然我們有這么一個專用的function，我們應該簡單地使用它。

這個東西是一個迭代器。 對於遍歷字符串，因此 function 名稱以 s 開頭。 開始部分定義了我們將在什么輸入范圍內操作，結束部分是默認構造的，然后有一個 std::regex 用於在輸入字符串中應該匹配/不應該匹配的內容。 匹配策略的類型由最后一個參數給出。

0 --> 給我在正則表達式中定義的東西和（可選）
-1 --> 告訴我根據正則表達式不匹配的內容。

我們可以使用這個迭代器將標記存儲在std::vector中。 std::vector有一個范圍構造函數，它接受 2 個迭代器作為參數，並將第一個迭代器和第二個迭代器之間的數據復制到 std::vector。 該聲明

std::vector tokens(std::sregex_token_iterator(s.begin(), s.end(), re, -1), {});

將變量“tokens”定義為 std::vector 並使用 std::vector 的所謂范圍構造函數。 請注意：我使用的是 C++17 並且可以在沒有模板參數的情況下定義std::vector 。 編譯器可以從給定的 function 參數中推斷出參數。 此功能稱為 CTAD（“類模板參數推導”）。

此外，您可以看到我沒有明確使用“end()”迭代器。

這個迭代器將從帶有正確類型的空大括號封閉的默認初始值設定項構造，因為由於std::vector構造函數需要它，它將被推斷為與第一個參數的類型相同。

您可以在一行中讀取任意數量的標記並將其放入std::vector

但你可以做得更多。 您可以驗證您的輸入。 如果您使用 0 作為最后一個參數，則定義一個std::regex甚至可以驗證您的輸入。 而且您只會獲得有效的令牌。

總體而言，專用功能的使用優於誤用的std::getline ，人們應該簡單地使用它。

有些人抱怨 function 開銷，他們是對的，但其中有多少人在使用大數據。 即使那樣，該方法也可能是使用string.find和string.substring或std::stringviews或其他。

所以，現在進入進一步的話題。

在提取器中，我們首先從源代碼 stream 中讀取完整的一行並檢查它是否有效。 或者，如果我們有文件結尾或任何其他錯誤。

然后我們如上所述標記剛剛讀取的字符串。

然后，我們將僅將標記中的選定列復制到我們的結果數據中。 這是在一個簡單的 for 循環中完成的。 在這里，我們還檢查了邊界，因為有人可能指定無效的選定列，或者一行的標記可能比預期的要少。

所以提取器的主體非常簡單。 只需 5 行代碼。 . .

然后，再次，

您應該開始使用 C++ 中的面向對象功能。 在 C++ 中，您可以將數據和對這些數據進行操作的方法放在一個 object 中。 原因是外界不應該關心對象的內部結構。 例如，您的readCSV和printCSV function 應該是結構（或類）的一部分。

下一步，我們將不再使用您的“讀取”和“打印”功能。 我們將使用專用的 function 用於 Stream-IO、提取器運算符 >> 和插入器運算符 <<。 我們將覆蓋結構中的標准 IO 函數。

在 function main我們將打開源文件並檢查是否打開成功。 順便提一句。 如果成功，則應檢查所有輸入 output 功能。

然后，我們使用下一個迭代器the std::istream_iterator 。 這與我們的“AgeAndWeight”類型和輸入文件 stream 一起。 同樣在這里，我們使用 CTAD 和默認構造的結束迭代器。 std::istream_iterator將重復調用 AgeAndWeight 提取器操作符，直到源文件的所有行都被讀取。

對於 output，我們將使用std::ostream_iterator 。 這將調用“AgeAndWeight”的插入器操作符，直到所有數據都被寫入。

從 C++ 中的 CSV 文件中提取某些列

問題描述

1 個解決方案

解決方案1
1 2020-04-04 14:04:18

從 C++ 中的 CSV 文件中提取某些列

問題描述

1 個解決方案

解決方案1 1 2020-04-04 14:04:18

解決方案1
1 2020-04-04 14:04:18