使用Java Spark逐行讀取大文本文件

Question

我正在嘗試讀取一個大文本文件（2 到 3 GB）。 我需要逐行讀取文本文件並將每一行轉換為 Json object。 我嘗試使用 .collect() 和 .toLocalIterator() 來讀取文本文件。 collect() 適用於小文件，但不適用於大文件。 我知道 .toLocalIterator() 將分散在集群周圍的數據收集到一個集群中。 根據文檔，toLocalIterator() 在處理大型 RDD 時無效，因為它會遇到 memory 問題。 有沒有一種有效的方法來讀取多節點集群中的大型文本文件？

下面是我嘗試讀取文件並將每一行轉換為 json 的各種嘗試的方法。

public static void jsonConversion() {
    JavaRDD<String> lines = sc.textFile(path);
    String newrows = lines.first(); //<--- This reads the first line of the text file


    // Reading through with
    // tolocaliterator--------------------------------------------
     Iterator<String> newstuff = lines.toLocalIterator();
     System.out.println("line 1 " + newstuff.next());
     System.out.println("line 2 " + newstuff.next());

    // Inserting lines in a list.
    // Note: .collect() is appropriate for small files
    // only.-------------------------
    List<String> rows = lines.collect();

    // Sets loop limit based on the number on lines in text file.
    int count = (int) lines.count();
    System.out.println("Number of lines are " + count);

    // Using google's library to create a Json builder.
    GsonBuilder gsonBuilder = new GsonBuilder();
    Gson gson = new GsonBuilder().setLenient().create();

    // Created an array list to insert json objects.
    ArrayList<String> jsonList = new ArrayList<>();

    // Converting each line of the text file into a Json formatted string and
    // inserting into the array list 'jsonList'
    for (int i = 0; i <= count - 1; i++) {
        String JSONObject = gson.toJson(rows.get(i));
        Gson prettyGson = new GsonBuilder().setPrettyPrinting().create();
        String prettyJson = prettyGson.toJson(rows.get(i));
        jsonList.add(prettyJson);
    }

    // For printing out the all the json objects
    int lineNumber = 1;
    for (int i = 0; i <= count - 1; i++) {
        System.out.println("line " + lineNumber + "-->" + jsonList.get(i));
        lineNumber++;
    }

}

下面是我正在使用的庫列表

//Spark Libraries
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

//Java Libraries
import java.util.ArrayList;
import java.util.List;
import java.util.Properties;

//Json Builder Libraries
import com.google.gson.Gson;
import com.google.gson.GsonBuilder;

Answer 1

您可以嘗試在 RDD 上使用 map function 而不是收集所有結果。

JavaRDD<String> lines = sc.textFile(path);
JavaRDD<String> jsonList = lines.map(line -> <<all your json transformations>>)

這樣，您將實現數據的分布式轉換。 更多關於 map function 的信息。

將數據轉換為列表或數組將強制在一個節點上進行數據收集。 如果要在 Spark 中實現計算分布，則需要使用 RDD 或 Dataframe 或 Dataset。

Answer 2

JavaRDD<String> lines = sc.textFile(path);

JavaRDD<String> jsonList = lines.map(line ->line.split("/"))

或者您可以在 map 中定義一個新方法

   JavaRDD<String> jsonList = lines.map(line ->{
   String newline = line.replace("","")
   return newline ;

})

//將JavaRDD轉換為DataFrame

在 Spark java 中將 JavaRDD 轉換為 DataFrame

dfTobeSaved.write.format("json").save("/root/data.json")

使用Java Spark逐行讀取大文本文件

問題描述

2 個解決方案

解決方案1
3 2019-11-15 11:13:47

解決方案2
2 已采納 2019-11-15 11:20:11

使用Java Spark逐行讀取大文本文件

問題描述

2 個解決方案

解決方案1 3 2019-11-15 11:13:47

解決方案2 2 已采納 2019-11-15 11:20:11

解決方案1
3 2019-11-15 11:13:47

解決方案2
2 已采納 2019-11-15 11:20:11