简体   繁体   English

火花PSV文件到数据帧的转换错误

[英]spark psv file to data frame conversion error

The spark version I am using is 2.0+ All I am trying to do is just read a pipe (|) separated values file into a Dataframe and then run SQL like queries. 我正在使用的Spark版本是2.0+,我要做的只是将管道(|)分隔的值文件读入Dataframe,然后像查询一样运行SQL。 I have tried comma delimited file too. 我也尝试了逗号分隔文件。 I am interacting with spark using spark-shell I have downloaded spark-csv jar and ran spark-shell with --packages option to import it into my session. 我正在使用spark-shell与spark进行交互,我已经下载了spark-csv jar并使用--packages选项运行spark-shell将其导入到会话中。 It was imported successfully. 已成功导入。

import spark.implicits._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql._
val session = 
SparkSession.builder().appName("test").master("local").getOrCreate()
    val df = session.read.format("com.databricks.spark.csv").option("header", "true").option("mode", "DROPMALFORMED").load("testdata.txt");

WARN Hive: Failed to access metastore. This class should not accessed in runtime.
apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hi
 at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236)
 at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
 at org.apache.hadoop.hive.ql.metadata.Hive.<clinit>(Hive.java:166)
 at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
 at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:171)
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
 at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
 at java.lang.reflect.Constructor.newInstance(Unknown Source)
 at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
 at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
 at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
 at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)

You can directly load psv file into RDD then split it as per your requirement and then you can apply schema on it. 您可以将psv文件直接加载到RDD中,然后根据需要将其拆分,然后可以在其上应用架构。 This is the java example. 这是java示例。

import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;

public class RDDtoDF_Update {
    public static void main(final  String[] args) throws Exception {

        SparkSession spark = SparkSession
                .builder()
                .appName("RDDtoDF_Updated")
                .master("local[2]")
                .config("spark.some.config.option", "some-value")
                .getOrCreate();

        StructType schema = DataTypes
                .createStructType(new StructField[] {
                        DataTypes.createStructField("eid", DataTypes.IntegerType, false),
                        DataTypes.createStructField("eName", DataTypes.StringType, false),
                        DataTypes.createStructField("eAge", DataTypes.IntegerType, true),
                        DataTypes.createStructField("eDept", DataTypes.IntegerType, true),
                        DataTypes.createStructField("eSal", DataTypes.IntegerType, true),
                        DataTypes.createStructField("eGen", DataTypes.StringType,true)});


        String filepath = "F:/Hadoop/Data/EMPData.txt";
        JavaRDD<Row> empRDD = spark.read()
                .textFile(filepath)
                .javaRDD()
                .map(line -> line.split("\t"))
                .map(r -> RowFactory.create(Integer.parseInt(r[0]), r[1].trim(),Integer.parseInt(r[2]),
                        Integer.parseInt(r[3]),Integer.parseInt(r[4]),r[5].trim() ));


        Dataset<Row> empDF = spark.createDataFrame(empRDD, schema);
        empDF.groupBy("edept").max("esal").show();

Thanks. 谢谢。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM