简体   繁体   English

Apache Spark,Java中的createDataFrame示例使用List <?>作为第一个参数

[英]Apache Spark, createDataFrame example in Java using List<?> as first argument

Can someone give an example of java implementation of public DataFrame createDataFrame(java.util.List<?> data,java.lang.Class<?> beanClass) function, as mentioned in Spark JavaDoc? 有人可以举一个java实现public DataFrame createDataFrame(java.util.List<?> data,java.lang.Class<?> beanClass)函数的例子,如Spark JavaDoc中所述?

I have a list of JSON strings I am passing as the first argument and hence I am passing String.class as the second argument, but it gives an error 我有一个JSON字符串列表,我作为第一个参数传递,因此我传递String.class作为第二个参数,但它给出了一个错误

java.lang.ClassCastException: org.apache.spark.sql.types.StringType$ cannot be cast to org.apache.spark.sql.types.StructType

not sure why, hence looking for an example. 不知道为什么,因此寻找一个例子。

The problem is your using of Bean Class . 问题是你使用Bean Class

From JavaBeans Wikipedia : 来自JavaBeans Wikipedia

JavaBeans are classes that encapsulate many objects into a single object (the bean). JavaBeans是将许多对象封装到单个对象(bean)中的类。 They are serializable, have a zero-argument constructor, and allow access to properties using getter and setter methods. 它们是可序列化的,具有零参数构造函数,并允许使用getter和setter方法访问属性。 The name "Bean" was given to encompass this standard, which aims to create reusable software components for Java. “Bean”这个名称包含了这个标准,旨在为Java创建可重用的软件组件。

To be more clear, let me give you an example using Java Bean in Spark: 为了更清楚,让我举一个在Spark中使用Java Bean的例子:

Suppose we use this Bean class: 假设我们使用这个Bean类:

import java.io.Serializable;

public class Bean implements Serializable {
    private static final long serialVersionUID = 1L;

    private String k;
    private String something;

    public String getK() {return k;}
    public String getSomething() {return something;}

    public void setK(String k) {this.k = k;}
    public void setSomething(String something) {this.something = something;}
}

And we have created b0 and b1 that are instances of Bean by: 我们创建了b0b1 ,它们是Bean的实例:

Bean b0 = new Bean();
b0.setK("k0");
b0.setSomething("sth0");
Bean b1 = new Bean();
b1.setK("k1");
b1.setSomething("sth1");

Also we have added beans( b0 , b1 here) into a List<Bean> called data : 我们b0 bean( b0b1 )添加到名为dataList<Bean>

List<Bean> data = new ArrayList<Bean>();
data.add(b0);
data.add(b1);

Now we can create a DataFrame using List<Bean> and Bean class: 现在我们可以使用List<Bean>Bean类创建一个DataFrame

DataFrame df = sqlContext.createDataFrame(data, Bean.class);

If we do df.show() , here is the output: 如果我们做df.show() ,这里是输出:

+---+---------+
|  k|something|
+---+---------+
| k0|     sth0|
| k1|     sth1|
+---+---------+

THE BETTER WAY TO CREATE DATAFRAME FROM JSON STRING 从JSON STRING创建数据帧的更好方法

In Spark, you could directly create DataFrame from a List of JSON Strings: 在Spark中,您可以直接从JSON字符串列表创建DataFrame

DataFrame df = sqlContext.read().json(jsc.parallelize(data));

where jsc is an instance of JavaSparkContext . 其中jscJavaSparkContext一个实例。

我邀请你看看有很多例子的火花源代码,特别是在单元测试中,你可以在这里找到java中createDataFrame所有引用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM