如何过滤Apache Spark Maven项目中列上的嘈杂数据？

Question

I m working on Apache Spark on java maven project.I have a subreddit comments like that in the this figure; 我正在java maven项目上的Apache Spark上工作。在此图中，我有一个subreddit注释；

     +--------+--------------+----------------------+-----------------+--------------------+----------------+-----------+-------------+-----+------+------+-------+--------+----------+----------+------------+-----+------------+----------+------------+---+
|archived|        author|author_flair_css_class|author_flair_text|                body|controversiality|created_utc|distinguished|downs|edited|gilded|     id| link_id|      name| parent_id|retrieved_on|score|score_hidden| subreddit|subreddit_id|ups|
+--------+--------------+----------------------+-----------------+--------------------+----------------+-----------+-------------+-----+------+------+-------+--------+----------+----------+------------+-----+------------+----------+------------+---+
|    true|       bostich|                  null|             null|                test|               0| 1192450635|         null|    0| false|     0|c0299an|t3_5yba3|t1_c0299an|  t3_5yba3|  1427426409|    1|       false|reddit.com|        t5_6|  1|
|    true|igiveyoumylife|                  null|             null|much smoother.

...|               0| 1192450639|         null|    0| false|     0|c0299ao|t3_5yba3|t1_c0299ao|  t3_5yba3|  1427426409|    2|       false|reddit.com|        t5_6|  2|
|    true|          Arve|                  null|             null|Can we please dep...|               0| 1192450643|         null|    0| false|     0|c0299ap|t3_5yba3|t1_c0299ap|t1_c02999p|  1427426409|    0|       false|reddit.com|        t5_6|  0|
|    true|     [deleted]|                  null|             null|           [deleted]|               0| 1192450646|         null|    0| false|     0|c0299aq|t3_5yba3|t1_c0299aq|  t3_5yba3|  1427426409|    1|       false|reddit.com|        t5_6|  1|
|    true|     gigaquack|                  null|             null|Oh, I see. Fancy ...|               0| 1192450646|         null|    0| false|     0|c0299ar|t3_5yba3|t1_c0299ar|t1_c0299ah|  1427426409|    3|       false|reddit.com|        t5_6|  3|
|    true|       Percept|                  null|             null|         testing ...|               0| 1192450656|         null|    0| false|     0|c0299as|t3_5yba3|t1_c0299as|  t3_5yba3|  1427426409|    1|       false|reddit.com|        t5_6|  1|

I parse data and I only show body column.I want to clean(filter) [deleted] comments and non latin alphabet comments in body column. 我解析数据，只显示主体列。我要在主体列中清除（过滤）[已删除]注释和非拉丁字母注释。 How can I do that? 我怎样才能做到这一点？ (Note : Data Size=32 GB) （注意：数据大小= 32 GB）

body:[Deleted]
body:How can I do that?

Answer 1

The following code-snippet is meant for Scala , but you can try and adapt it for Java 以下代码段适用于Scala ，但您可以尝试使其适应Java

Use the Dataset.filter(..) method as follows 使用Dataset.filter(..)方法，如下所示

import org.apache.spark.sql.{DataFrame, SparkSession}

val filteredData: DataFrame = dirtyData.
  filter(dirtyData("body") =!= "[Deleted]" && dirtyData("body").rlike("[\\x00-\\x7F]"))

Explanation 说明

dirtyData("body") =!= "[Deleted]" drops all rows where column body has value [Deleted] (you might want to handle upper & lower case too). dirtyData("body") =!= "[Deleted]"删除列body值为[Deleted]所有行（您可能也想处理大小写 ）。 See Column =!= 见Column =!=
dirtyData("body").rlike("[\\\\x00-\\\\x7F]") drops all rows where body does NOT contain an ASCII character (I haven't researched much on this part, but you can look for a better regex ). dirtyData("body").rlike("[\\\\x00-\\\\x7F]")删除body中不包含ASCII字符的所有行（我对此部分没有做过多研究，但是您可以寻找更好的选择regex ）。 See Column.rlike(..) 参见Column.rlike(..)

References 参考

如何过滤Apache Spark Maven项目中列上的嘈杂数据？

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-12-25 01:26:08

如何过滤Apache Spark Maven项目中列上的嘈杂数据？

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-12-25 01:26:08

解决方案1
0 已采纳 2018-12-25 01:26:08