拆分并过滤Spark中的数据框列

Question

I am working with apache spark. 我正在使用apache spark。 I have the following txt file. 我有以下txt文件。

05:49:56.604899 00:00:00:00:00:02 > 00:00:00:00:00:03, ethertype IPv4 (0x0800), length 10202: 10.0.0.2.54880 > 10.0.0.3.5001: Flags [.], seq 3641977583:3641987719, ack 129899328, win 58, options [nop,nop,TS val 432623 ecr 432619], length 10136
05:49:56.604908 00:00:00:00:00:03 > 00:00:00:00:00:02, ethertype IPv4 (0x0800), length 66: 10.0.0.3.5001 > 10.0.0.2.54880: Flags [.], ack 10136, win 153, options [nop,nop,TS val 432623 ecr 432623], length 0
05:49:56.604900 00:00:00:00:00:02 > 00:00:00:00:00:03, ethertype IPv4 (0x0800), length 4410: 10.0.0.2.54880 > 10.0.0.3.5001: Flags [P.], seq 10136:14480, ack 1, win 58, options [nop,nop,TS val 432623 ecr 432619], length 4344

Now I would like to extract the IPs and time stamp from the file. 现在我想从文件中提取IP和时间戳。 For example the output should become like below: 例如，输出应该如下所示：

05:49: 56.604899 10.0.0.2 54880 10.0.0.3 5001
05:49: 56.604908 10.0.0.3 5001 10.0.0.2 54880
05:49: 56.604900 10.0.0.2 54880 10.0.0.3 5001

Here is the code that I used: 这是我使用的代码：

object ML_Test {
  def main(args: Array[String]) {

    val conf = new SparkConf().setAppName("saeed_test").setMaster("local[*]")
    val sc = new SparkContext(conf)

    val sqlContext = new SQLContext(sc)

    val customSchema = StructType(Array(
      StructField("column0", StringType, true),
      StructField("column1", StringType, true),
      StructField("column2", StringType, true)))

    val df = sqlContext.read
      .format("com.databricks.spark.csv")
      .option("header", "true") // Use first line of all files as header
      .schema(customSchema)
      .load("/Users/saeedtkh/Desktop/sharedsaeed/train.csv")

    val selectedData = df.select("column0", "column1", "column2")
    selectedData.write
      .format("com.databricks.spark.csv")
      .option("header", "true")
      .save("/Users/saeedtkh/Desktop/sharedsaeed/tempoutput.txt")
  }
}

However I could just extract the following result. 但是我可以提取以下结果。 (I also could not apply split function here) （我也不能在这里应用分割功能）

column0                                                  column1              column2
05:29:59.546965 00:00:00:00:00:01 > 00:00:00:00:00:03  ethertype IPv4 (0x0800) 05:29:59.546965 00:00:00:00:00:01 > 00:00:00:00:00:05
05:29:59.546986 00:00:00:00:00:01 > 00:00:00:00:00:03  ethertype IPv4 (0x0800)  length 66: 10.0.0.1.5001 > 10.0.0.3.43906: Flags [.]
05:29:59.546986 00:00:00:00:00:01 > 00:00:00:00:00:03  ethertype IPv4 (0x0800)  length 66: 10.0.0.1.5001 > 10.0.0.3.43906: Flags [.]

Can you help me to modify this code to do the above result. 你能帮我修改一下这段代码来完成上述结果。 Please help me. 请帮我。

Update1: As I try to execute the answer number one, some fields can not recognize in my idea: [![enter image description here][1]][1] Update1：当我尝试执行第一个答案时，有些字段在我的想法中无法识别：[！[在此输入图像描述] [1]] [1]

I added the following library and the problem get solved: 我添加了以下库，问题得到解决：

import org.apache.spark.sql.Row

Update2: According to answer number one, as I ran the code in my idea, I got an empty folder as result. Update2：根据第一个答案，当我在我的想法中运行代码时，我得到了一个空文件夹。 (Process finished with exit code 1) The errors are: （完成退出代码1的过程）错误是：

       17/05/24 09:45:52 ERROR Utils: Aborting task
java.lang.ArrayIndexOutOfBoundsException: 2
    at ML_Test$$anonfun$2.apply(ML_Test.scala:28)
    at ML_Test$$anonfun$2.apply(ML_Test.scala:25)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:254)
    at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
    at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325)
    at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
    at org.apache.spark.scheduler.Task.run(Task.scala:85)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
17/05/24 09:45:52 ERROR DefaultWriterContainer: Task attempt attempt_201705240945_0000_m_000001_0 aborted.
17/05/24 09:45:52 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
org.apache.spark.SparkException: Task failed while writing rows
    at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
    at org.apache.spark.scheduler.Task.run(Task.scala:85)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
    at ML_Test$$anonfun$2.apply(ML_Test.scala:28)
    at ML_Test$$anonfun$2.apply(ML_Test.scala:25)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:254)
    at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
    at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325)
    at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
    ... 8 more

Answer 1

Following can be your solution 以下可以是您的解决方案

val customSchema = StructType(Array(
  StructField("column0", StringType, true),
  StructField("column1", StringType, true),
  StructField("column2", StringType, true)))

val rdd = sc.textFile("/Users/saeedtkh/Desktop/sharedsaeed/train.csv")
val rowRdd =rdd.map(line => line.split(">")).map(array => {
  val first = Try(array(0).trim.split(" ")(0)) getOrElse ""
  val second = Try(array(1).trim.split(" ")(6)) getOrElse ""
  val third = Try(array(2).trim.split(" ")(0).replace(":", "")) getOrElse ""
  Row.fromSeq(Seq(first, second, third))
})

val dataFrame = sqlContext.createDataFrame(rowRdd, customSchema)

val selectedData = dataFrame.select("column0", "column1", "column2")
import org.apache.spark.sql.SaveMode
selectedData.write
  .mode(SaveMode.Overwrite)
  .format("com.databricks.spark.csv")
  .option("header", "true")
  .save("/Users/saeedtkh/Desktop/sharedsaeed/tempoutput.txt")

Answer 2

I think this is what you needed. 我认为这就是你所需要的。 This may not be the most effective solution but it works. 这可能不是最有效的解决方案，但它有效。

  val spark =
    SparkSession.builder().master("local").appName("test").getOrCreate()

  import spark.implicits._

  val customSchema = StructType(
    Array(StructField("column0", StringType, true),
          StructField("column1", StringType, true),
          StructField("column2", StringType, true)))

  val data = spark.read
    .schema(schema = customSchema)
    .csv(
      "tempoutput.txt")

  data
    .withColumn("column0", split($"column0", " "))
    .withColumn("column1", split($"column2", " "))
    .withColumn("column2", split($"column2", " "))
    .select(
      $"column0".getItem(0).as("column0"),
      $"column1".getItem(3).as("column1"),
      $"column2".getItem(5).as("column2")
    )
    .show()

拆分并过滤Spark中的数据框列

问题描述

2 个解决方案

解决方案1
0 已采纳 2017-05-23 15:02:46

解决方案2
0 2017-05-23 15:12:34

拆分并过滤Spark中的数据框列

问题描述

2 个解决方案

解决方案1 0 已采纳 2017-05-23 15:02:46

解决方案2 0 2017-05-23 15:12:34

解决方案1
0 已采纳 2017-05-23 15:02:46

解决方案2
0 2017-05-23 15:12:34