简体   繁体   中英

How to filter out rows from spark dataframe containing unreadable characters

I am reading a parquet file containing some fields like device ID, imei, etc. This parquet file was written by reading a sequence file made of cascading.tuple.Tuple(s).

Some rows contain unreadable characters which I want to ditch completely.

Here is how I am reading the file:

val sparkSession = SparkSession.builder().master(sparkMaster).appName(sparkAppName).config("spark.driver.memory", "32g").getOrCreate()

sparkSession.sparkContext.hadoopConfiguration.set("io.serializations", "cascading.tuple.hadoop.TupleSerialization") 

val df=sparkSession.read.parquet("hdfs://**.46.**.2*2:8020/test/oldData.parquet")

df.printSchema()

val filteredDF=df.select($"$DEVICE_ID", $"$DEVICE_ID_NEW", $"$IMEI”, $”$WIFI_MAC_ADDRESS", $"$BLUETOOTH_MAC_ADDRESS", $"$TIMESTAMP").filter($"$TIMESTAMP" > 1388534400 && $"$TIMESTAMP" < 1483228800)

filteredDF.show(100)

import org.apache.spark.sql.functions.{udf,col,regexp_replace,trim}

val len=udf{ColVal:String => ColVal.size}

val new1DF=filteredDF.select(trim(col("deviceId")))

new1DF.show(100)

val newDF=new1DF.filter((len(col("deviceId")) <20))

newDF.show(100)

Even after applying a filter on those device ID whose length is less than 20, I still get those rows which has very long device ID containing mostly whitespaces and unreadable characters.

Can some one point out some leads which may help me to filter such rows.

I have also tried to filter out those device IDs containing Specials. Using this:

df.filter($"$DEVICE_ID" rlike "/[^\�]/g")

I got an empty dataframe.

SCHEMA:

root
 |-- deviceId: string (nullable = true)
 |-- deviceIdNew: string (nullable = true)
 |-- imei: string (nullable = true)
 |-- wifiMacAddress: string (nullable = true)
 |-- bluetoothMacAddress: string (nullable = true)
 |-- timestamp: long (nullable = true)

ROWS WITH UNREADABLE CHARACTERS:

+--------------------+
|      trim(deviceId)|
+--------------------+
|                    |
|+~C���...|
|���
    Cv�...|
|���
    Cv�...|
|             �#Inten|
|                �$
                   �|
|                    |
|                    |
|                    |
|                    |
|    0353445a712d877b|
|    0577bc8a29754939|
|    0577bc8a29754939|
|    0577bc8a29754939|
|    0577bc8a29754939|
|    0577bc8a29754939|
|    0577bc8a29754939|
|    0577bc8a29754939|
|    08bdae9e37b48080|

UNREADABLE ROW VALUES

    val filteredDF=df.select("deviceId")
                     .filter((len(col("deviceId")) <17))
                     .filter($"$DEVICE_ID" rlike "^([A-Z]|[0-9]|[a-z])+$") 

solved the issue.

What I was not using earlier was regex wild cards ^ for start of match and $ for end of match. This ensured that only rows with exact matching deviceId values gets through the filter.

This website really helped me to generate and test desired regular expression.

You can filter by regular expression. for example you might use regex_replace to replace all those with unreadable characters (ie everything EXCEPT alphanumeric or printable or whatever you decide) with some value (eg a 21 characters constant or even empty string) and then filter according to this.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM