How to filter out rows from spark dataframe containing unreadable characters

Question

I am reading a parquet file containing some fields like device ID, imei, etc. This parquet file was written by reading a sequence file made of cascading.tuple.Tuple(s).

Some rows contain unreadable characters which I want to ditch completely.

Here is how I am reading the file:

val sparkSession = SparkSession.builder().master(sparkMaster).appName(sparkAppName).config("spark.driver.memory", "32g").getOrCreate()

sparkSession.sparkContext.hadoopConfiguration.set("io.serializations", "cascading.tuple.hadoop.TupleSerialization") 

val df=sparkSession.read.parquet("hdfs://**.46.**.2*2:8020/test/oldData.parquet")

df.printSchema()

val filteredDF=df.select($"$DEVICE_ID", $"$DEVICE_ID_NEW", $"$IMEI”, $”$WIFI_MAC_ADDRESS", $"$BLUETOOTH_MAC_ADDRESS", $"$TIMESTAMP").filter($"$TIMESTAMP" > 1388534400 && $"$TIMESTAMP" < 1483228800)

filteredDF.show(100)

import org.apache.spark.sql.functions.{udf,col,regexp_replace,trim}

val len=udf{ColVal:String => ColVal.size}

val new1DF=filteredDF.select(trim(col("deviceId")))

new1DF.show(100)

val newDF=new1DF.filter((len(col("deviceId")) <20))

newDF.show(100)

Even after applying a filter on those device ID whose length is less than 20, I still get those rows which has very long device ID containing mostly whitespaces and unreadable characters.

Can some one point out some leads which may help me to filter such rows.

I have also tried to filter out those device IDs containing Specials. Using this:

df.filter($"$DEVICE_ID" rlike "/[^\�]/g")

I got an empty dataframe.

SCHEMA:

root
 |-- deviceId: string (nullable = true)
 |-- deviceIdNew: string (nullable = true)
 |-- imei: string (nullable = true)
 |-- wifiMacAddress: string (nullable = true)
 |-- bluetoothMacAddress: string (nullable = true)
 |-- timestamp: long (nullable = true)

ROWS WITH UNREADABLE CHARACTERS:

+--------------------+
|      trim(deviceId)|
+--------------------+
|                    |
|+~C���...|
|���
    Cv�...|
|���
    Cv�...|
|             �#Inten|
|                �$
                   �|
|                    |
|                    |
|                    |
|                    |
|    0353445a712d877b|
|    0577bc8a29754939|
|    0577bc8a29754939|
|    0577bc8a29754939|
|    0577bc8a29754939|
|    0577bc8a29754939|
|    0577bc8a29754939|
|    0577bc8a29754939|
|    08bdae9e37b48080|

UNREADABLE ROW VALUES

Answer 1

    val filteredDF=df.select("deviceId")
                     .filter((len(col("deviceId")) <17))
                     .filter($"$DEVICE_ID" rlike "^([A-Z]|[0-9]|[a-z])+$")

solved the issue.

What I was not using earlier was regex wild cards ^ for start of match and $ for end of match. This ensured that only rows with exact matching deviceId values gets through the filter.

This website really helped me to generate and test desired regular expression.

Answer 2

You can filter by regular expression. for example you might use regex_replace to replace all those with unreadable characters (ie everything EXCEPT alphanumeric or printable or whatever you decide) with some value (eg a 21 characters constant or even empty string) and then filter according to this.

How to filter out rows from spark dataframe containing unreadable characters

Question

2 answers

solution1
3 2016-11-25 06:04:24

solution2
0 2016-11-24 11:00:31

How to filter out rows from spark dataframe containing unreadable characters

Question

2 answers

solution1 3 2016-11-25 06:04:24

solution2 0 2016-11-24 11:00:31

solution1
3 2016-11-25 06:04:24

solution2
0 2016-11-24 11:00:31