简体   繁体   English

如何使用spark java创建数据框

[英]How to create a dataframe using spark java

I need to create a data frame in my test.我需要在我的测试中创建一个数据框。 I tried the code below:我试过下面的代码:

StructType structType = new StructType();
structType = structType.add("A", DataTypes.StringType, false);
structType = structType.add("B", DataTypes.StringType, false);

List<String> nums = new ArrayList<String>();
nums.add("value1");
nums.add("value2");

Dataset<Row> df = spark.createDataFrame(nums, structType);

The expected result is :预期结果是:

 +------+------+
 |A     |B     |
 +------+------+
 |value1|value2|
 +------+------+

But it is not accepted.但它不被接受。 How do I initiate a data frame/Dataset?如何启动数据框/数据集?

So this is the cleaner way of doing things.所以这是更干净的做事方式。

Step 1: Create a bean class for your custom class.步骤 1:为您的自定义类创建一个 bean 类。 Make sure you have public getter, setter and all args constructor and the class should implement serializable确保您有公共 getter、setter 和所有 args 构造函数,并且该类应该实现可序列化

public class StringWrapper implements Serializable {
  private String key;
  private String value;

  public StringWrapper(String key, String value) {
    this.key = key;
    this.value = value;
  }

  public String getKey() {
    return key;
  }

  public void setKey(String key) {
    this.key = key;
  }

  public String getValue() {
    return value;
  }

  public void setValue(String value) {
    this.value = value;
  }
}

Step 2: Generate data第 2 步:生成数据

List<StringWrapper> nums = new ArrayList<>();
nums.add(new StringWrapper("value1", "value2"));

Step 3: Convert it to RDD第 3 步:将其转换为 RDD

JavaRDD<StringWrapper> rdd = javaSparkContext.parallelize(nums);

Step 4: Convert it to dataset第 4 步:将其转换为数据集

sparkSession.createDataFrame(rdd, StringWrapper.class).show(false);

Step 5 : See results第 5 步:查看结果

+------+------+
|key   |value |
+------+------+
|value1|value2|
+------+------+

For Spark 3.0 and before, SparkSession instances don't have a method to create dataframe from list of Objects and a StructType .对于 Spark 3.0 及之前版本, SparkSession实例没有从 Objects 列表和StructType创建数据帧的方法。

However, there is a method that can build dataframe from list of rows and a StructType .但是,有一种方法可以从行列表和StructType构建数据StructType So to make your code work, you have to change your nums type from ArrayList<String> to ArrayList<Row> .因此,要使您的代码正常工作,您必须将nums类型从ArrayList<String>更改为ArrayList<Row> You can do that using RowFactory :您可以使用RowFactory做到这一点

// imports
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;

// code
StructType structType = new StructType();
structType = structType.add("A", DataTypes.StringType, false);
structType = structType.add("B", DataTypes.StringType, false);

List<Row> nums = new ArrayList<Row>();
nums.add(RowFactory.create("value1", "value2"));

Dataset<Row> df = spark.createDataFrame(nums, structType);

// result
// +------+------+
// |A     |B     |
// +------+------+
// |value1|value2|
// +------+------+

If you want to add more rows to your dataframe, just add other rows:如果要向数据框中添加更多行,只需添加其他行:

// code
...

List<Row> nums = new ArrayList<Row>();
nums.add(RowFactory.create("value1", "value2"));
nums.add(RowFactory.create("value3", "value4"));

Dataset<Row> df = spark.createDataFrame(nums, structType);

// result
// +------+------+
// |A     |B     |
// +------+------+
// |value1|value2|
// |value3|value4|
// +------+------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM