简体   繁体   English

将 .txt Spark 输出转换为 .csv

[英]Converting .txt Spark Output to .csv

Currently, I am getting output from a spark job in .txt file.目前,我正在从 .txt 文件中的 spark 作业获取输出。 I am trying to convert it to .csv我正在尝试将其转换为 .csv

.txt output (Dataset <String>) .txt 输出(Dataset <String>)

John MIT Bachelor ComputerScience Mike UB Master ComputerScience

.csv output .csv 输出

NAME, UNIV, DEGREE, COURSE
   John,MIT,Bachelor,ComputerScience
   Amit,UB,Master,ComputerScience

I tried to collect it into a List and I am not sure, how to convert it to .csv and add the header.我试图将它收集到一个列表中,但我不确定如何将其转换为 .csv 并添加标题。

This is a simple approach that converts the txt output data into a data structure (that can easily be written into a csv file).这是一种简单的方法,可以将 txt 输出数据转换为数据结构(可以轻松写入 csv 文件)。

The basic idea is using data structures along with the amount of headers / columns in order to parse entry sets from the one liner txt output.基本思想是使用数据结构以及标题/列的数量,以便从一个 liner txt 输出中解析条目集。

Have a look at the code comments, every " TODO 4 U" means work for you, mostly because I cannot really guess what you need to do at those positions in the code (like how to get the headers).看看代码注释,每一个TODO 4 U”都对你有用,主要是因为我无法真正猜测你需要在代码中的那些位置做什么(比如如何获取标题)。

This is just a main method that does its work straight forward.这只是一个直接完成其工作的主要方法。 You may want to understand what it does and apply changes that make the code meet your requiremtens.您可能想了解它的作用并应用更改以使代码满足您的要求。 Input and output are just String s that you have to create, receive or process yourself.输入和输出只是您必须自己创建、接收或处理的String

public static void main(String[] args) {

    // TODO 4 U: get the values for the header somehow
    String headerLine = "NAME, UNIV, DEGREE, COURSE";

    // TODO 4 U: read the txt output
    String txtOutput = "John MIT Bachelor ComputerScience Mike UB Master ComputerScience";

    /*
     * then split the header line
     * (or do anything similar, I don't know where your header comes from)
     */
    String[] headers = headerLine.split(", ");

    // store the amount of headers, which is the amount of columns
    int amountOfColumns = headers.length;

    // split txt output data by space
    String[] data = txtOutput.split(" ");

    /*
     * declare a data structure that stores lists of Strings,
     * each one is representing a line of the csv file
     */
    Map<Integer, List<String>> linesForCsv = new TreeMap<Integer, List<String>>();

    // get the length of the txt output data
    int a = data.length;

    // create a list of Strings containing the headers and put it into the data structure
    List<String> columnHeaders = Arrays.asList(headers);
    linesForCsv.put(0, columnHeaders);

    // declare a line counter for the csv file
    int l = 0;
    // go through the txt output data in order to get the lines for the csv file
    for (int i = 0; i < a; i++) {
        // check if there is a new line to be created
        if (i % amountOfColumns == 0) {
            /*
             * every time the amount of headers is reached,
             * create a new list for a new line in the csv file
             */
            l++; // increment the line counter (even at 0 because the header row is inserted at 0)
            linesForCsv.put(l, new ArrayList<String>()); // create a new line-list
            linesForCsv.get(l).add(data[i]); // add the data to the line-list
        } else {
            // if there is no new line to be created, store the data in the current one
            linesForCsv.get(l).add(data[i]);
        }
    }

    // print the lines stored in the map
    // TODO 4 U: write this to a csv file instead of just printing it to the console
    linesForCsv.forEach((lineNumber, line) -> {
        System.out.println("Line " + lineNumber + ": " + String.join(",", line));
    });
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM