简体   繁体   中英

How can I convert a list of map List<Map<String, String>> myList to Spark Dataframe in Java?

I have a list of Map like this,

List<Map<String, Object>> myList = new ArrayList<>();

Map<String, Object> mp1 = new HashMap<>();
mp1.put("id", 1);
mp1.put("name", "John");

Map<String, Object> mp2 = new HashMap<>();
mp2.put("id", 2);
mp2.put("name", "Carte");

the key-value pairs the we are putting in the map are not fixed, we can have any dynamic key-value pairs(dynamic schema).

I want to convert it into spark dataframe. ( Dataset< Row > ).

+--+--------+
| id | name |
+--+--------+
| 1 | John |
+--+--------+
| 2 | Carte |
+--+--------+

How this can be achieved?

Note: As I said, the key-value pairs are dynamic, I can not create a java bean in advance and use this below syntax.

Dataset<Row> ds = spark.createDataFrame(myList, MyClass.class);

You can build rows and schema from the list of maps, then use spark.createDataFrame(rows: java.util.List[Row], schema: StructType) to build your dataframe:

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.catalyst.expressions.GenericRow;
import org.apache.spark.sql.types.*;

...

public static Dataset<Row> buildDataframe(List<Map<String, Object>> listOfMaps, SparkSession spark) {
  // extract columns name list
  Set<String> columnSet = new HashSet<>();
  for (Map<String, Object> elem: listOfMaps) {
    columnSet.addAll(elem.keySet());
  }
  List<String> columns = new ArrayList<>(columnSet);

  // build rows
  List<Row> rows = new ArrayList<>();
  for (Map<String, Object> elem : listOfMaps) {
    List<Object> row = new ArrayList<>();
    for (String key: columns) {
      row.add(elem.get(key));
    }
    rows.add(new GenericRow(row.toArray()));
  }

  // build schema
  List<StructField> fields = new ArrayList<>();
  for (String column: columns) {
    fields.add(new StructField(column, getDataType(column, listOfMaps), true, Metadata.empty()));
  }
  StructType schema = new StructType(fields.toArray(new StructField[0]));

  // build dataframe from rows and schema
  return spark.createDataFrame(rows, schema);

}

public static DataType getDataType(String column, List<Map<String, Object>> data) {
  for (Map<String, Object> elem : data) {
    if (elem.get(column) != null) {
      return getDataType(elem.get(column));
    }
  }
  return DataTypes.NullType;
}

public static DataType getDataType(Object value) {
  if (value.getClass() == Integer.class) {
    return DataTypes.IntegerType;
  } else if (value.getClass() == String.class) {
    return DataTypes.StringType;
    // TODO add all other spark types (Long, Timestamp, etc...)
  } else {
    throw new IllegalArgumentException("unknown type for value " + value);
  }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM