[英]Kafka topic object into spark data frame conversion and writing into HDFS
[英]spark psv file to data frame conversion error
我正在使用的Spark版本是2.0+,我要做的只是将管道(|)分隔的值文件读入Dataframe,然后像查询一样运行SQL。 我也尝试了逗号分隔文件。 我正在使用spark-shell与spark进行交互,我已经下载了spark-csv jar并使用--packages选项运行spark-shell将其导入到会话中。 已成功导入。
import spark.implicits._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql._
val session =
SparkSession.builder().appName("test").master("local").getOrCreate()
val df = session.read.format("com.databricks.spark.csv").option("header", "true").option("mode", "DROPMALFORMED").load("testdata.txt");
WARN Hive: Failed to access metastore. This class should not accessed in runtime.
apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hi
at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236)
at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
at org.apache.hadoop.hive.ql.metadata.Hive.<clinit>(Hive.java:166)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:171)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
您可以将psv文件直接加载到RDD中,然后根据需要将其拆分,然后可以在其上应用架构。 这是java示例。
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
public class RDDtoDF_Update {
public static void main(final String[] args) throws Exception {
SparkSession spark = SparkSession
.builder()
.appName("RDDtoDF_Updated")
.master("local[2]")
.config("spark.some.config.option", "some-value")
.getOrCreate();
StructType schema = DataTypes
.createStructType(new StructField[] {
DataTypes.createStructField("eid", DataTypes.IntegerType, false),
DataTypes.createStructField("eName", DataTypes.StringType, false),
DataTypes.createStructField("eAge", DataTypes.IntegerType, true),
DataTypes.createStructField("eDept", DataTypes.IntegerType, true),
DataTypes.createStructField("eSal", DataTypes.IntegerType, true),
DataTypes.createStructField("eGen", DataTypes.StringType,true)});
String filepath = "F:/Hadoop/Data/EMPData.txt";
JavaRDD<Row> empRDD = spark.read()
.textFile(filepath)
.javaRDD()
.map(line -> line.split("\t"))
.map(r -> RowFactory.create(Integer.parseInt(r[0]), r[1].trim(),Integer.parseInt(r[2]),
Integer.parseInt(r[3]),Integer.parseInt(r[4]),r[5].trim() ));
Dataset<Row> empDF = spark.createDataFrame(empRDD, schema);
empDF.groupBy("edept").max("esal").show();
谢谢。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.