I m working on Apache Spark on java maven project.I have a subreddit comments like that in the this figure;
+--------+--------------+----------------------+-----------------+--------------------+----------------+-----------+-------------+-----+------+------+-------+--------+----------+----------+------------+-----+------------+----------+------------+---+
|archived| author|author_flair_css_class|author_flair_text| body|controversiality|created_utc|distinguished|downs|edited|gilded| id| link_id| name| parent_id|retrieved_on|score|score_hidden| subreddit|subreddit_id|ups|
+--------+--------------+----------------------+-----------------+--------------------+----------------+-----------+-------------+-----+------+------+-------+--------+----------+----------+------------+-----+------------+----------+------------+---+
| true| bostich| null| null| test| 0| 1192450635| null| 0| false| 0|c0299an|t3_5yba3|t1_c0299an| t3_5yba3| 1427426409| 1| false|reddit.com| t5_6| 1|
| true|igiveyoumylife| null| null|much smoother.
...| 0| 1192450639| null| 0| false| 0|c0299ao|t3_5yba3|t1_c0299ao| t3_5yba3| 1427426409| 2| false|reddit.com| t5_6| 2|
| true| Arve| null| null|Can we please dep...| 0| 1192450643| null| 0| false| 0|c0299ap|t3_5yba3|t1_c0299ap|t1_c02999p| 1427426409| 0| false|reddit.com| t5_6| 0|
| true| [deleted]| null| null| [deleted]| 0| 1192450646| null| 0| false| 0|c0299aq|t3_5yba3|t1_c0299aq| t3_5yba3| 1427426409| 1| false|reddit.com| t5_6| 1|
| true| gigaquack| null| null|Oh, I see. Fancy ...| 0| 1192450646| null| 0| false| 0|c0299ar|t3_5yba3|t1_c0299ar|t1_c0299ah| 1427426409| 3| false|reddit.com| t5_6| 3|
| true| Percept| null| null| testing ...| 0| 1192450656| null| 0| false| 0|c0299as|t3_5yba3|t1_c0299as| t3_5yba3| 1427426409| 1| false|reddit.com| t5_6| 1|
I parse data and I only show body column.I want to clean(filter) [deleted] comments and non latin alphabet comments in body column. How can I do that? (Note : Data Size=32 GB)
body:[Deleted]
body:How can I do that?
The following code-snippet is meant for Scala
, but you can try and adapt it for Java
Use the Dataset.filter(..)
method as follows
import org.apache.spark.sql.{DataFrame, SparkSession}
val filteredData: DataFrame = dirtyData.
filter(dirtyData("body") =!= "[Deleted]" && dirtyData("body").rlike("[\\x00-\\x7F]"))
Explanation
dirtyData("body") =!= "[Deleted]"
drops all rows where column body
has value [Deleted]
(you might want to handle upper & lower case too). See Column =!=
dirtyData("body").rlike("[\\\\x00-\\\\x7F]")
drops all rows where body
does NOT contain an ASCII
character (I haven't researched much on this part, but you can look for a better regex
). See Column.rlike(..)
References
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.