简体   繁体   English

C++文件转换:pipe分隔为逗号分隔

[英]C++ file conversion: pipe delimited to comma delimited

I am trying to figure out how to turn this input file that is in pipe delimited form into comma delimited.我试图弄清楚如何将 pipe 分隔形式的输入文件转换为逗号分隔。 I have to open the file, read it into an array, convert it into comma delimited in an output CSV file and then close all files.我必须打开文件,将其读入数组,将其转换为 output CSV 文件中的逗号分隔,然后关闭所有文件。 I have been told that the easiest way to do is within excel but I am not quite sure how.有人告诉我,最简单的方法是在 excel 内,但我不太确定如何。

#include <iostream>
#include <fstream>
using namespace std;

int main()
{
    ifstream inFile;
    string myArray[5];

    cout << "Enter the input filename:";
    cin >> inFileName;

    inFile.open(inFileName);
    if(inFile.is_open())
    std::cout<<"File Opened"<<std::endl;

    // read file line by line into array
    cout<<"Read";

    for(int i = 0; i < 5; ++i)
    {
       file >> myArray[i];
    }

    // File conversion 

    // close input file
    inFile.close();

    // close output file
    outFile.close();
...

What I need to convert is:我需要转换的是:

Miles per hour|6,445|being the "second" team |5.54|9.98|6,555.00    
"Ending" game| left at "beginning"|Elizabeth, New Jersey|25.25|6.78|987.01   
|End at night, or during the day|"Let's go"|65,978.21|0.00|123.45    
Left-base night|10/07/1900|||4.07|777.23       
"Let's start it"|Start Baseball Game|Starting the new game to win  

What the output should look like in comma-delimited form: output 应以逗号分隔的形式显示:

Miles per hour,"6,445","being the ""second"" team member",5.54,9.98,"6,555.00",    
"""Ending"" game","left at ""beginning""","Denver, Colorado",25.25,6.78,987.01,      
,"End at night, during the day","""Let's go""","65,978.21",0.00,123.45,       
Left-base night, 10/07/1900,,,4.07,777.23,               
"""Let's start it""", Start Baseball Game, Starting the new game to win,         

I will show you a complete solution and explain it to you.我将向您展示一个完整的解决方案并向您解释。 But let's first have view on it:但让我们先来看看它:

#include <iostream>
#include <vector>
#include <fstream>
#include <regex>
#include <string>
#include <algorithm>

// I omit in the example here the manual input of the filenames. This exercise can be done by somebody else
// Use fixed filenames in this example.
const std::string inputFileName("r:\\input.txt");
const std::string outputFileName("r:\\output.txt");

// The delimiter for the source csv file
std::regex re{ R"(\|)" };

std::string addQuotes(const std::string& s) {
    // if there are single quotes in the string, then replace them with double quotes
    std::string result = std::regex_replace(s, std::regex(R"(")"), R"("")");

    // If there is any quote (") or comma in the file, then quote the complete string
    if (std::any_of(result.begin(), result.end(), [](const char c) { return ((c == '\"') || (c == ',')); })) {
        result = "\"" + result + "\"";
    }
    return result;
}


// Some output function
void printData(std::vector<std::vector<std::string>>& v, std::ostream& os) {
    // Go throug all rows
    std::for_each(v.begin(), v.end(), [&os](const std::vector<std::string>& vs) {
        // Define delimiter
        std::string delimiter{ "" };
        // Show the delimited strings
        for (const std::string& s : vs) {
            os << delimiter << s;
            delimiter = ",";
        }
        os << "\n";
    });
}

int main() {


    // We first open the ouput file, becuse, if this cannot be opened, then no meaning to do the rest of the exercise
    // Open output file and check, if it could be opened
    if (std::ofstream outputFileStream(outputFileName); outputFileStream) {

        // Open the input file and check, if it could be opened
        if (std::ifstream inputFileStream(inputFileName); inputFileStream) {

            // In this variable we will store all lines from the CSV file including the splitted up columns
            std::vector<std::vector<std::string>> data{};

            // Now read all lines of the CSV file and split it into tokens
            for (std::string line{}; std::getline(inputFileStream, line); ) {

                // Split line into tokens and add to our resulting data vector
                data.emplace_back(std::vector<std::string>(std::sregex_token_iterator(line.begin(), line.end(), re, -1), {}));
            }
            std::for_each(data.begin(), data.end(), [](std::vector<std::string>& vs) {
                std::transform(vs.begin(), vs.end(), vs.begin(), addQuotes);
            });

            // Output, to file
            printData(data, outputFileStream);

            // And to the screen
            printData(data, std::cout);
        }
        else {
            std::cerr << "\n*** Error: could not open input file '" << inputFileName << "'\n";
        }

    }
    else {
        std::cerr << "\n*** Error: could not open output file '" << outputFileName << "'\n";
    }
    return 0;
}

So, then let's have a look.那么,让我们来看看吧。 We have function我们有 function

  • main , read csv files, split it into tokens, convert it, and write it main ,读取csv文件,拆分成token,转换,写入
  • addQuotes . addQuotes Add quote if necessary必要时添加报价
  • printData print he converted data to an output stream printData打印他将数据转换为 output stream

Let's start with main .让我们从main开始。 main will first open the input file and the output file. main将首先打开输入文件和 output 文件。

The input file contains a kind of structured data and is also called csv (comma separted values).输入文件包含一种结构化数据,也称为 csv(逗号分隔值)。 But here we do not have a comma, but a pipe symbol as delimter.但是这里我们没有逗号,而是一个 pipe 符号作为分隔符。

And the result will be typically stored in a 2d-vector.结果通常存储在二维向量中。 In dimension 1 is the rows and the other dimension is for the columns.维度 1 是行,另一个维度是列。

So, what do we need to do next?那么,接下来我们需要做什么呢? As we can see, we need to read first all complete text lines form the source stream.正如我们所看到的,我们首先需要阅读源代码 stream 中的所有完整文本行。 This can be easily done with a one-liner:这可以通过单线轻松完成:

for (std::string line{}; std::getline(inputFileStream, line); ) {

As you can see here , the for statement has an declaration/initialization part, then a condition, and then a statement, carried out at the end of the loop.正如您在此处看到的,for 语句有一个声明/初始化部分,然后是一个条件,然后是一个语句,在循环结束时执行。 This is well known.这是众所周知的。

We first define a variable "line" of type std::string and use the default initializer to create an empty string.我们首先定义一个std::string类型的变量“line”,并使用默认初始化器创建一个空字符串。 Then we use std::getline to read from the stream a complete line and put it into our variable.然后我们使用std::getline从 stream 读取完整的一行并将其放入我们的变量中。 The std::getline returns a reference to sthe stream, and the stream has an overloaded bool operator, where it returns, if there was a failure (or end of file). std::getline返回对 sthe stream 的引用,并且 stream 有一个重载的 bool 运算符,如果出现故障(或文件结尾),它会返回。 So, the for loop does not need an additional check for the end of file.因此,for 循环不需要额外检查文件结尾。 And we do not use the last statement of the for loop, because by reading a line, the file pointer is advanced automatically.而且我们不使用 for 循环的最后一条语句,因为通过读取一行,文件指针会自动前进。

This gives us a very simple for loop, fo reading a complete file line by line.这给了我们一个非常简单的 for 循环,逐行读取一个完整的文件。

Please note: Defining the variable "line" in the for loop, will scope it to the for loop.请注意:在 for 循环中定义变量“line”,将 scope 到 for 循环。 Meaning, it is only visible in the for loop.意思是,它只在 for 循环中可见。 This is generally a good solution to prevent the pollution of the outer name space.这通常是防止外部名称空间污染的一个很好的解决方案。

OK, now the next line:好的,现在下一行:

data.emplace_back(std::vector<std::string>(std::sregex_token_iterator(line.begin(), line.end(), digit), {}));

Uh Oh, what is that?哦哦,那是什么?

OK, lets go step by step.好的,让 go 一步一步来。 First, we obviously want to add someting to our 2-dimensionsal data vector.首先,我们显然想在我们的二维数据向量中添加一些东西。 We will use the std::vector s functionemplace_back .我们将使用std::vector s functionemplace_back We could have used also used push_back , but this would mean that we need to do unnecessary copying of data.我们也可以使用push_back ,但这意味着我们需要对数据进行不必要的复制。 Hence, we selected emplace_back to do an in place construction of the thing that we want to add to our 2-dimensionsal data vector.因此,我们选择emplace_back来对我们想要添加到二维数据向量中的东西进行就地构造。

And what do we want to add?我们要添加什么? We want to add a complete row, so a vector of columns.我们想要添加一个完整的行,因此是一个列向量。 In our case a std::vector<std::string> .在我们的例子中是一个std::vector<std::string> And, becuase we want to do in inplace construction of this vector, we call it with the vectors range constructor.而且,因为我们想在原地构造这个向量,所以我们用向量范围构造函数来调用它。 Please see here: Constructor number 5 .请参阅此处:构造函数编号 5 The range constructor takes 2 iterators, a begin and an end iterator, as parameter, and copies all values pointed to by the iterators into the vector. range 构造函数接受 2 个迭代器,一个 begin 和一个 end 迭代器,作为参数,并将迭代器指向的所有值复制到向量中。

So, we expect a begin and an end iterator.所以,我们期望一个开始和结束迭代器。 And what do we see here:我们在这里看到了什么:

  • The begin iterator is: std::sregex_token_iterator(line.begin(), line.end(), digit)开始迭代器是: std::sregex_token_iterator(line.begin(), line.end(), digit)
  • And the end iterator is simply {}结束迭代器只是{}

But what is this thing, the sregex_token_iterator ?但这是什么东西, sregex_token_iterator

This is an iterator that iterates over patterns in a line.这是一个迭代器,它迭代一行中的模式。 And the pattern is given by a regex .模式由正则表达式给出。 You may read here about the C++ regex libraray.您可以在此处阅读有关 C++ 正则表达式库的信息。 Since it is very powerful, you unfortunately need to learn about it a little longer.由于它非常强大,不幸的是,您需要了解它的时间更长一些。 And I cannot cover it here.我不能在这里覆盖它。 But let us describe its basic functionality for our purpose: You can describe a pattern in some kind of meta language, and the std::sregex_token_iterator will look for that pattern, and, if it finds a match, return the related data.但让我们为我们的目的描述它的基本功能:您可以用某种元语言描述一个模式, std::sregex_token_iterator将查找该模式,如果找到匹配项,则返回相关数据。 In our case the pattern is very simple: Digits.在我们的例子中,模式非常简单:数字。 This can be desribed with "\d+" and means, try to match one or more digits.这可以用“\d+”来描述,意味着尝试匹配一个或多个数字。

Now to the {} as the end iterator.现在将{}作为结束迭代器。 You may have read that the {} will do default construction/initialization.您可能已经读到{}将执行默认构造/初始化。 And if you read here, number 1 , then you see that the "default-constructor" constructs an end-of-sequence iterator.如果您在这里阅读数字 1 ,那么您会看到“默认构造函数”构造了一个序列结束迭代器。 So, exactly what we need.所以,正是我们需要的。


After we have read all data, we will transform the single strings, to the required output.读取所有数据后,我们将单个字符串转换为所需的 output。 This will be done with std::transform and the function addQuotes .这将通过std::transform和 function addQuotes来完成。 The strategy here is to first replace the single quotes with double quotes.这里的策略是先用双引号替换单引号。

And then, next, we look, if there is any comma or quote in the string, then we enclose the whole string additionally in quotes.然后,接下来,我们看一下,如果字符串中有任何逗号或引号,那么我们将整个字符串附加在引号中。

And last, but not least, we have a simple output function and print the converted data into a file and on the screen.最后但同样重要的是,我们有一个简单的 output function 并将转换后的数据打印到文件中并在屏幕上打印。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM