简体   繁体   English

如何使用Scala在Spark中的Row类型的RDD上进行拆分

[英]How can do Split on RDD of Row type in Spark using Scala

I have a file json on HDFS, I read it:我在 HDFS 上有一个文件json ,我读了它:

var data = sqlContext.read.json("/.....")

This following, it's Schema:下面是 Schema:

 |-- @timestamp: string (nullable = true)
 |-- beat: struct (nullable = true)
 |    |-- hostname: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- version: string (nullable = true)
 |-- fields: struct (nullable = true)
 |    |-- env: string (nullable = true)
 |    |-- env2: string (nullable = true)
 |    |-- env3: struct (nullable = true)
 |    |    |-- format: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- version: double (nullable = true)
 |-- input_type: string (nullable = true)
 |-- text: string (nullable = true)

I want do a split of the field text , I tried by:我想对字段text进行拆分,我尝试过:

var myRDD = data.select("text").rdd

var split_myRDD = myRDD.map(ligne => ligne.split("|"))

It does not work: error: value split is not a member of org.apache.spark.sql.Row它不起作用:错误:值拆分不是org.apache.spark.sql.Row的成员

Someone can tell me where's the fault ?谁能告诉我错在哪里?

You don't need to convert to RDD for that.您不需要为此转换为 RDD。 You can use split function in DF.您可以在 DF 中使用拆分功能。 The code will look like this.代码将如下所示。

df.select("text")
  .withColumn("text_split", split(col("text"), "\\|"))

You can also use RDD if there is some special need and use "\\\\|"如果有一些特殊需要,您也可以使用 RDD 并使用“\\\\|” to split the text.拆分文本。 I hope it helps.我希望它有帮助。

When you read your JSON, the resulted object is a DataFrame.当您读取 JSON 时,结果对象是一个 DataFrame。 When you convert a Dataframe to an RDD, you will have an array of [Row].当您将 Dataframe 转换为 RDD 时,您将拥有一个 [Row] 数组。 A Row Class describe a Row from your Dataframe, and have the same schema as your DataFrame. Row Class 描述来自您的 Dataframe 的 Row,并且具有与您的 DataFrame 相同的架构。 To be able to take an element from a Row you should do this:为了能够从 Row 中获取元素,您应该这样做:

myRDD
  .map(row => row.getString(row.fieldIndex("text")).split("|"))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM