[英]How can I convert a list of map List<Map<String, String>> myList to Spark Dataframe in Java?
I have a list of Map like this,我有一个像这样的 Map 的列表,
List<Map<String, Object>> myList = new ArrayList<>();
Map<String, Object> mp1 = new HashMap<>();
mp1.put("id", 1);
mp1.put("name", "John");
Map<String, Object> mp2 = new HashMap<>();
mp2.put("id", 2);
mp2.put("name", "Carte");
the key-value pairs the we are putting in the map are not fixed, we can have any dynamic key-value pairs(dynamic schema).我们放入 map 的键值对不是固定的,我们可以有任何动态键值对(动态模式)。
I want to convert it into spark dataframe.我想把它转换成火花 dataframe。 ( Dataset< Row > ).
(数据集<行> )。
+--+--------+ +--+--------+
| | id |
编号 | name |
姓名 |
+--+--------+ +--+--------+
| | 1 |
1 | John |
约翰 |
+--+--------+ +--+--------+
| | 2 |
2 | Carte |
点菜 |
+--+--------+ +--+--------+
How this can be achieved?如何做到这一点?
Note: As I said, the key-value pairs are dynamic, I can not create a java bean in advance and use this below syntax.注意:正如我所说,键值对是动态的,我不能提前创建 java bean 并使用以下语法。
Dataset<Row> ds = spark.createDataFrame(myList, MyClass.class);
You can build rows and schema from the list of maps, then use spark.createDataFrame(rows: java.util.List[Row], schema: StructType)
to build your dataframe:您可以从地图列表中构建行和模式,然后使用
spark.createDataFrame(rows: java.util.List[Row], schema: StructType)
构建您的 dataframe:
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.catalyst.expressions.GenericRow;
import org.apache.spark.sql.types.*;
...
public static Dataset<Row> buildDataframe(List<Map<String, Object>> listOfMaps, SparkSession spark) {
// extract columns name list
Set<String> columnSet = new HashSet<>();
for (Map<String, Object> elem: listOfMaps) {
columnSet.addAll(elem.keySet());
}
List<String> columns = new ArrayList<>(columnSet);
// build rows
List<Row> rows = new ArrayList<>();
for (Map<String, Object> elem : listOfMaps) {
List<Object> row = new ArrayList<>();
for (String key: columns) {
row.add(elem.get(key));
}
rows.add(new GenericRow(row.toArray()));
}
// build schema
List<StructField> fields = new ArrayList<>();
for (String column: columns) {
fields.add(new StructField(column, getDataType(column, listOfMaps), true, Metadata.empty()));
}
StructType schema = new StructType(fields.toArray(new StructField[0]));
// build dataframe from rows and schema
return spark.createDataFrame(rows, schema);
}
public static DataType getDataType(String column, List<Map<String, Object>> data) {
for (Map<String, Object> elem : data) {
if (elem.get(column) != null) {
return getDataType(elem.get(column));
}
}
return DataTypes.NullType;
}
public static DataType getDataType(Object value) {
if (value.getClass() == Integer.class) {
return DataTypes.IntegerType;
} else if (value.getClass() == String.class) {
return DataTypes.StringType;
// TODO add all other spark types (Long, Timestamp, etc...)
} else {
throw new IllegalArgumentException("unknown type for value " + value);
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.