[英]How to create a dataframe using spark java
I need to create a data frame in my test.我需要在我的测试中创建一个数据框。 I tried the code below:我试过下面的代码:
StructType structType = new StructType();
structType = structType.add("A", DataTypes.StringType, false);
structType = structType.add("B", DataTypes.StringType, false);
List<String> nums = new ArrayList<String>();
nums.add("value1");
nums.add("value2");
Dataset<Row> df = spark.createDataFrame(nums, structType);
The expected result is :预期结果是:
+------+------+
|A |B |
+------+------+
|value1|value2|
+------+------+
But it is not accepted.但它不被接受。 How do I initiate a data frame/Dataset?如何启动数据框/数据集?
So this is the cleaner way of doing things.所以这是更干净的做事方式。
Step 1: Create a bean class for your custom class.步骤 1:为您的自定义类创建一个 bean 类。 Make sure you have public getter, setter and all args constructor and the class should implement serializable确保您有公共 getter、setter 和所有 args 构造函数,并且该类应该实现可序列化
public class StringWrapper implements Serializable {
private String key;
private String value;
public StringWrapper(String key, String value) {
this.key = key;
this.value = value;
}
public String getKey() {
return key;
}
public void setKey(String key) {
this.key = key;
}
public String getValue() {
return value;
}
public void setValue(String value) {
this.value = value;
}
}
Step 2: Generate data第 2 步:生成数据
List<StringWrapper> nums = new ArrayList<>();
nums.add(new StringWrapper("value1", "value2"));
Step 3: Convert it to RDD第 3 步:将其转换为 RDD
JavaRDD<StringWrapper> rdd = javaSparkContext.parallelize(nums);
Step 4: Convert it to dataset第 4 步:将其转换为数据集
sparkSession.createDataFrame(rdd, StringWrapper.class).show(false);
Step 5 : See results第 5 步:查看结果
+------+------+
|key |value |
+------+------+
|value1|value2|
+------+------+
For Spark 3.0 and before, SparkSession
instances don't have a method to create dataframe from list of Objects and a StructType
.对于 Spark 3.0 及之前版本, SparkSession
实例没有从 Objects 列表和StructType
创建数据帧的方法。
However, there is a method that can build dataframe from list of rows and a StructType
.但是,有一种方法可以从行列表和StructType
构建数据StructType
。 So to make your code work, you have to change your nums
type from ArrayList<String>
to ArrayList<Row>
.因此,要使您的代码正常工作,您必须将nums
类型从ArrayList<String>
更改为ArrayList<Row>
。 You can do that using RowFactory :您可以使用RowFactory做到这一点:
// imports
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
// code
StructType structType = new StructType();
structType = structType.add("A", DataTypes.StringType, false);
structType = structType.add("B", DataTypes.StringType, false);
List<Row> nums = new ArrayList<Row>();
nums.add(RowFactory.create("value1", "value2"));
Dataset<Row> df = spark.createDataFrame(nums, structType);
// result
// +------+------+
// |A |B |
// +------+------+
// |value1|value2|
// +------+------+
If you want to add more rows to your dataframe, just add other rows:如果要向数据框中添加更多行,只需添加其他行:
// code
...
List<Row> nums = new ArrayList<Row>();
nums.add(RowFactory.create("value1", "value2"));
nums.add(RowFactory.create("value3", "value4"));
Dataset<Row> df = spark.createDataFrame(nums, structType);
// result
// +------+------+
// |A |B |
// +------+------+
// |value1|value2|
// |value3|value4|
// +------+------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.