简体   繁体   中英

Read Big File (over 60GB) and Write new File

There is one file that is 60GB in size and 200,000,000 rows. The payload of the file is shown below.

source.txt

0.0 4.6 6.3 3.8 5.0 0.0 -3.8 -5.9 1.5 14.2 0.0 1.0 6.9 5.8 6.1 0.0 5.4 -7.1 0.9 6.8 0.0 -1.8 2.6 0.0 -11.5 -0.0 
0.0 13.4 -1.8 5.2 2.4 0.0 -7.1 -12.5 -2.8 11.8 0.0 2.0 5.5 3.5 8.2 0.0 9.2 -18.2 -3.4 1.7 0.0 -16.1 3.2 0.0 9.7 -0.1 
0.0 12.2 -2.0 7.2 0.1 0.0 -9.1 -11.8 -2.5 8.8 0.0 1.1 4.6 3.8 8.0 0.0 8.3 -18.5 -5.0 0.6 0.0 -14.3 2.8 0.0 10.6 -0.0 
0.0 10.6 -0.6 8.3 -2.2 0.0 -9.4 -8.4 -1.5 5.3 0.0 1.9 3.5 3.6 7.1 0.0 7.6 -16.5 -5.7 0.6 0.0 -9.5 1.9 0.0 7.8 0.0 

I want to read the files in order and make a new file including the sequence number. The payload of the file to be created is as follows.

destination.txt

 1: 0.0 4.6 6.3 3.8 5.0 0.0 -3.8 -5.9 1.5 14.2 0.0 1.0 6.9 5.8 6.1 0.0 5.4 -7.1 0.9 6.8 0.0 -1.8 2.6 0.0 -11.5 -0.0 
 2: 0.0 13.4 -1.8 5.2 2.4 0.0 -7.1 -12.5 -2.8 11.8 0.0 2.0 5.5 3.5 8.2 0.0 9.2 -18.2 -3.4 1.7 0.0 -16.1 3.2 0.0 9.7 -0.1 
 3: 0.0 12.2 -2.0 7.2 0.1 0.0 -9.1 -11.8 -2.5 8.8 0.0 1.1 4.6 3.8 8.0 0.0 8.3 -18.5 -5.0 0.6 0.0 -14.3 2.8 0.0 10.6 -0.0 
 4: 0.0 10.6 -0.6 8.3 -2.2 0.0 -9.4 -8.4 -1.5 5.3 0.0 1.9 3.5 3.6 7.1 0.0 7.6 -16.5 -5.7 0.6 0.0 -9.5 1.9 0.0 7.8 0.0 

I can use Java to do the following

    String filePath = "/filepath";

    Path path = Paths.get(filePath+"/source.txt");

    BufferedReader bufferedReader = Files.newBufferedReader(path);

    Stream<String> lines = bufferedReader.lines();
    AtomicLong seq = new AtomicLong(0);

    BufferedWriter bufferedWriter = Files.newBufferedWriter(Paths.get(filePath+"/dest.txt"));

    lines.forEach(txt -> {
        try {
            bufferedWriter.append(seq.addAndGet(1) + ":" + txt);
            bufferedWriter.newLine();
        } catch (IOException e) {
            e.printStackTrace();
        }
    });

but I'm wondering if it's possible using a distributed universal framework like Spark or Storm or Hadoop. I think big data frameworks make it faster

There is something which may be helpful in Spark.

  1. Create an RDD from the CSV file
  2. Use a combination of zipWithIndex, sortBy, map

Check https://stackoverflow.com/a/26081548/290036 for a zipWithIndex example

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM