简体   繁体   English

将 Java 中的地图列表转换为 spark 中的数据集

[英]Convert a List of Map in Java to Dataset in spark

I have a list of Map in java, essentially representing rows.我在java中有一个Map列表,基本上代表行。

List<Map<String, Object>> dataList = new ArrayList<>();
Map<String, Object> row1 = new HashMap<>();
row1.put("fund", "f1");
row1.put("broker", "b1");
row1.put("qty", 100);

Map<String, Object> row2 = new HashMap<>();
row2.put("fund", "f2");
row2.put("broker", "b2");
row2.put("qty", 200);

dataList.add(row1);
dataList.add(row2);

I'm trying to create a Spark DataFrame from it.我正在尝试从中创建一个 Spark DataFrame。

I've tried to convert it into JavaRDD<Map<String, Object>> using我尝试将其转换为JavaRDD<Map<String, Object>>使用

JavaRDD<Map<String,Object>> rows = sc.parallelize(dataList);

But I'm not sure how to go from here to Dataset<Row> .但我不确定如何从这里转到Dataset<Row> I've seen Scala examples but none in Java.我见过 Scala 的例子,但在 Java 中没有。

I also tried to convert the list to JSON string, and read the JSON string.我还尝试将列表转换为 JSON 字符串,并读取 JSON 字符串。

String jsonStr = mapper.writeValueAsString(dataList);

But seems like I will have to write it to a file to then read using但似乎我必须将它写入文件然后使用

Dataset<Row> df = spark.read().json(pathToFile);

I would prefer to do it in-memory if possible rather than write to file and read from there.如果可能的话,我宁愿在内存中进行,而不是写入文件并从那里读取。

SparkConf sparkConf = new SparkConf().setAppName("SparkTest").setMaster("local[*]")
            .set("spark.sql.shuffle.partitions", "1");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
    SparkSession sparkSession = 
SparkSession.builder().config(sparkConf).getOrCreate();

List<Map<String, Object>> dataList = new ArrayList<>();
Map<String, Object> row1 = new HashMap<>();
row1.put("fund", "f1");
row1.put("broker", "b1");
row1.put("qty", 100);

Map<String, Object> row2 = new HashMap<>();
row2.put("fund", "f2");
row2.put("broker", "b2");
row2.put("qty", 200);

dataList.add(row1);
dataList.add(row2);

ObjectMapper mapper = new ObjectMapper();
    
String jsonStr = mapper.writeValueAsString(dataList);
JavaRDD<Map<String,Object>> rows = sc.parallelize(dataList);
Dataset<Row> data = sparkSession.createDataFrame(rows, Map.class);
data.show();

You do not need to use RDDs at all.您根本不需要使用 RDD。 What you need to do is extract the desired schema from your list of maps, transform you list of maps into a list of rows and then use spark.createDataFrame .您需要做的是从地图列表中提取所需的模式,将地图列表转换为行列表,然后使用spark.createDataFrame

In java, that's a bit painful, particularly when creating the Row objects, but here is how it could go:在 java 中,这有点痛苦,特别是在创建Row对象时,但它是如何进行的:

List<String> cols = new ArrayList(dataList.get(0).keySet());
List<Row> rows = dataList
    .stream()
    .map(row -> cols.stream().map(c -> (Object) row.get(c).toString()))
    .map(row -> row.collect(Collectors.toList()))
    .map(row -> JavaConverters.asScalaBufferConverter(row).asScala().toSeq())
    .map(Row$.MODULE$::fromSeq)
    .collect(Collectors.toList());

StructType schema = new StructType(
    cols.stream()
        .map(c -> new StructField(c, DataTypes.StringType, true, new Metadata()))
        .collect(Collectors.toList())
        .toArray(new StructField[0])
);
Dataset<Row> result = spark.createDataFrame(rows, schema);
public class MyRow implements Serializable {

  private String fund;
  private String broker;
  private int qty;

  public MyRow(String fund, String broker, int qty) {
    super();
    this.fund = fund;
    this.broker = broker;
    this.qty = qty;
  }

  public String getFund() {
    return fund;
  }

  public void setFund(String fund) {
    this.fund = fund;
  }


  public String getBroker() {
    return broker;
  }

  public void setBroker(String broker) {
    this.broker = broker;
  }

  public int getQty() {
    return qty;
  }

  public void setQty(int qty) {
    this.qty = qty;
  }

}

Now create an ArrayList.现在创建一个 ArrayList。 Each item in this list will act as row in final dataframe.此列表中的每个项目都将作为最终数据框中的行。

MyRow r1 = new MyRow("f1", "b1", 100);
MyRow r2 = new MyRow("f2", "b2", 200);
List<MyRow> dataList = new ArrayList<>();
dataList.add(r1);
dataList.add(r2);

Now We have to convert this List into a DataSet -现在我们必须将此 List 转换为 DataSet -

Dataset<Row> ds = spark.createDataFrame(dataList, MyRow.class);
ds.show()

The spark document had already point out how to load in-memory json string. spark文档已经指出了如何加载内存中的json字符串。

Here is the example fromhttps://spark.apache.org/docs/latest/sql-data-sources-json.html这是来自https://spark.apache.org/docs/latest/sql-data-sources-json.html的示例

// Alternatively, a DataFrame can be created for a JSON dataset represented by
// a Dataset<String> storing one JSON object per string.
List<String> jsonData = Arrays.asList(
        "{\"name\":\"Yin\",\"address\":{\"city\":\"Columbus\",\"state\":\"Ohio\"}}");
Dataset<String> anotherPeopleDataset = spark.createDataset(jsonData, Encoders.STRING());
Dataset<Row> anotherPeople = spark.read().json(anotherPeopleDataset);
anotherPeople.show();
// +---------------+----+
// |        address|name|
// +---------------+----+
// |[Columbus,Ohio]| Yin|
// +---------------+----+

May this help you.愿这对你有帮助。

import org.apache.spark.api.java.function.Function;
private static JavaRDD<Map<String, Object>> rows;
private static final Function f = (Function<Map<String, Object>, Row>) strObjMap -> RowFactory.create(new TreeMap<String, Object>(strObjMap).values().toArray(new Object[0]));
public void test(){
    rows = sc.parallelize(list);
    JavaRDD<Row> rowRDD = rows.map(f);
    Map<String, Object> headMap = list.get(0);
    TreeMap<String, Object> headerMap = new TreeMap<>(headMap);
    List<StructField> fields = new ArrayList<>();
    StructField field;
    for (String key : headerMap.keySet()) {
        System.out.println("key:::"+key);
        Object value = list.get(0).get(key);
        if (value instanceof Integer) {
            field = DataTypes.createStructField(key, DataTypes.IntegerType, true);
        }
        else if (value instanceof Double) {
            field = DataTypes.createStructField(key, DataTypes.DoubleType, true);
        }
        else if (value instanceof Date || value instanceof java.util.Date) {
            field = DataTypes.createStructField(key, DataTypes.DateType, true);
        }
        else {
            field = DataTypes.createStructField(key, DataTypes.StringType, true);
        }
            fields.add(field);
    }
    StructType struct = DataTypes.createStructType(fields);
    Dataset<Row> data = this.spark.createDataFrame(rowRDD, struct);
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM