Read Big File (over 60GB) and Write new File

Question

There is one file that is 60GB in size and 200,000,000 rows. The payload of the file is shown below.

source.txt

0.0 4.6 6.3 3.8 5.0 0.0 -3.8 -5.9 1.5 14.2 0.0 1.0 6.9 5.8 6.1 0.0 5.4 -7.1 0.9 6.8 0.0 -1.8 2.6 0.0 -11.5 -0.0 
0.0 13.4 -1.8 5.2 2.4 0.0 -7.1 -12.5 -2.8 11.8 0.0 2.0 5.5 3.5 8.2 0.0 9.2 -18.2 -3.4 1.7 0.0 -16.1 3.2 0.0 9.7 -0.1 
0.0 12.2 -2.0 7.2 0.1 0.0 -9.1 -11.8 -2.5 8.8 0.0 1.1 4.6 3.8 8.0 0.0 8.3 -18.5 -5.0 0.6 0.0 -14.3 2.8 0.0 10.6 -0.0 
0.0 10.6 -0.6 8.3 -2.2 0.0 -9.4 -8.4 -1.5 5.3 0.0 1.9 3.5 3.6 7.1 0.0 7.6 -16.5 -5.7 0.6 0.0 -9.5 1.9 0.0 7.8 0.0

I want to read the files in order and make a new file including the sequence number. The payload of the file to be created is as follows.

destination.txt

 1: 0.0 4.6 6.3 3.8 5.0 0.0 -3.8 -5.9 1.5 14.2 0.0 1.0 6.9 5.8 6.1 0.0 5.4 -7.1 0.9 6.8 0.0 -1.8 2.6 0.0 -11.5 -0.0 
 2: 0.0 13.4 -1.8 5.2 2.4 0.0 -7.1 -12.5 -2.8 11.8 0.0 2.0 5.5 3.5 8.2 0.0 9.2 -18.2 -3.4 1.7 0.0 -16.1 3.2 0.0 9.7 -0.1 
 3: 0.0 12.2 -2.0 7.2 0.1 0.0 -9.1 -11.8 -2.5 8.8 0.0 1.1 4.6 3.8 8.0 0.0 8.3 -18.5 -5.0 0.6 0.0 -14.3 2.8 0.0 10.6 -0.0 
 4: 0.0 10.6 -0.6 8.3 -2.2 0.0 -9.4 -8.4 -1.5 5.3 0.0 1.9 3.5 3.6 7.1 0.0 7.6 -16.5 -5.7 0.6 0.0 -9.5 1.9 0.0 7.8 0.0

I can use Java to do the following

    String filePath = "/filepath";

    Path path = Paths.get(filePath+"/source.txt");

    BufferedReader bufferedReader = Files.newBufferedReader(path);

    Stream<String> lines = bufferedReader.lines();
    AtomicLong seq = new AtomicLong(0);

    BufferedWriter bufferedWriter = Files.newBufferedWriter(Paths.get(filePath+"/dest.txt"));

    lines.forEach(txt -> {
        try {
            bufferedWriter.append(seq.addAndGet(1) + ":" + txt);
            bufferedWriter.newLine();
        } catch (IOException e) {
            e.printStackTrace();
        }
    });

but I'm wondering if it's possible using a distributed universal framework like Spark or Storm or Hadoop. I think big data frameworks make it faster

Answer 1

There is something which may be helpful in Spark.

Create an RDD from the CSV file
Use a combination of zipWithIndex, sortBy, map

Check https://stackoverflow.com/a/26081548/290036 for a zipWithIndex example

Read Big File (over 60GB) and Write new File

Question

1 answers

solution1
0 ACCPTED 2019-10-02 06:10:10

Read Big File (over 60GB) and Write new File

Question

1 answers

solution1 0 ACCPTED 2019-10-02 06:10:10

solution1
0 ACCPTED 2019-10-02 06:10:10