[英]How I filter noisy data on column in apache spark maven project?
I m working on Apache Spark on java maven project.I have a subreddit comments like that in the this figure; 我正在java maven项目上的Apache Spark上工作。在此图中,我有一个subreddit注释;
+--------+--------------+----------------------+-----------------+--------------------+----------------+-----------+-------------+-----+------+------+-------+--------+----------+----------+------------+-----+------------+----------+------------+---+
|archived| author|author_flair_css_class|author_flair_text| body|controversiality|created_utc|distinguished|downs|edited|gilded| id| link_id| name| parent_id|retrieved_on|score|score_hidden| subreddit|subreddit_id|ups|
+--------+--------------+----------------------+-----------------+--------------------+----------------+-----------+-------------+-----+------+------+-------+--------+----------+----------+------------+-----+------------+----------+------------+---+
| true| bostich| null| null| test| 0| 1192450635| null| 0| false| 0|c0299an|t3_5yba3|t1_c0299an| t3_5yba3| 1427426409| 1| false|reddit.com| t5_6| 1|
| true|igiveyoumylife| null| null|much smoother.
...| 0| 1192450639| null| 0| false| 0|c0299ao|t3_5yba3|t1_c0299ao| t3_5yba3| 1427426409| 2| false|reddit.com| t5_6| 2|
| true| Arve| null| null|Can we please dep...| 0| 1192450643| null| 0| false| 0|c0299ap|t3_5yba3|t1_c0299ap|t1_c02999p| 1427426409| 0| false|reddit.com| t5_6| 0|
| true| [deleted]| null| null| [deleted]| 0| 1192450646| null| 0| false| 0|c0299aq|t3_5yba3|t1_c0299aq| t3_5yba3| 1427426409| 1| false|reddit.com| t5_6| 1|
| true| gigaquack| null| null|Oh, I see. Fancy ...| 0| 1192450646| null| 0| false| 0|c0299ar|t3_5yba3|t1_c0299ar|t1_c0299ah| 1427426409| 3| false|reddit.com| t5_6| 3|
| true| Percept| null| null| testing ...| 0| 1192450656| null| 0| false| 0|c0299as|t3_5yba3|t1_c0299as| t3_5yba3| 1427426409| 1| false|reddit.com| t5_6| 1|
I parse data and I only show body column.I want to clean(filter) [deleted] comments and non latin alphabet comments in body column. 我解析数据,只显示主体列。我要在主体列中清除(过滤)[已删除]注释和非拉丁字母注释。 How can I do that? 我怎样才能做到这一点? (Note : Data Size=32 GB) (注意:数据大小= 32 GB)
body:[Deleted]
body:How can I do that?
The following code-snippet is meant for Scala
, but you can try and adapt it for Java
以下代码段适用于Scala
,但您可以尝试使其适应Java
Use the Dataset.filter(..)
method as follows 使用Dataset.filter(..)
方法,如下所示
import org.apache.spark.sql.{DataFrame, SparkSession}
val filteredData: DataFrame = dirtyData.
filter(dirtyData("body") =!= "[Deleted]" && dirtyData("body").rlike("[\\x00-\\x7F]"))
Explanation 说明
dirtyData("body") =!= "[Deleted]"
drops all rows where column body
has value [Deleted]
(you might want to handle upper & lower case too). dirtyData("body") =!= "[Deleted]"
删除列body
值为[Deleted]
所有行(您可能也想处理大小写 )。 See Column =!=
见Column =!=
dirtyData("body").rlike("[\\\\x00-\\\\x7F]")
drops all rows where body
does NOT contain an ASCII
character (I haven't researched much on this part, but you can look for a better regex
). dirtyData("body").rlike("[\\\\x00-\\\\x7F]")
删除body
中不包含ASCII
字符的所有行(我对此部分没有做过多研究,但是您可以寻找更好的选择regex
)。 See Column.rlike(..)
参见Column.rlike(..)
References 参考
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.