如何从包含不可读字符的spark数据帧中过滤掉行

Question

I am reading a parquet file containing some fields like device ID, imei, etc. This parquet file was written by reading a sequence file made of cascading.tuple.Tuple(s). 我正在读一个包含设备ID，imei等字段的镶木地板文件。这个镶木地板文件是通过读取由cascading.tuple.Tuple（s）组成的序列文件编写的。

Some rows contain unreadable characters which I want to ditch completely. 有些行包含不可读的字符，我想完全放弃。

Here is how I am reading the file: 这是我如何阅读文件：

val sparkSession = SparkSession.builder().master(sparkMaster).appName(sparkAppName).config("spark.driver.memory", "32g").getOrCreate()

sparkSession.sparkContext.hadoopConfiguration.set("io.serializations", "cascading.tuple.hadoop.TupleSerialization") 

val df=sparkSession.read.parquet("hdfs://**.46.**.2*2:8020/test/oldData.parquet")

df.printSchema()

val filteredDF=df.select($"$DEVICE_ID", $"$DEVICE_ID_NEW", $"$IMEI”, $”$WIFI_MAC_ADDRESS", $"$BLUETOOTH_MAC_ADDRESS", $"$TIMESTAMP").filter($"$TIMESTAMP" > 1388534400 && $"$TIMESTAMP" < 1483228800)

filteredDF.show(100)

import org.apache.spark.sql.functions.{udf,col,regexp_replace,trim}

val len=udf{ColVal:String => ColVal.size}

val new1DF=filteredDF.select(trim(col("deviceId")))

new1DF.show(100)

val newDF=new1DF.filter((len(col("deviceId")) <20))

newDF.show(100)

Even after applying a filter on those device ID whose length is less than 20, I still get those rows which has very long device ID containing mostly whitespaces and unreadable characters. 即使在长度小于20的设备ID上应用过滤器之后，我仍然会得到那些具有很长设备ID的行，其中包含大部分空格和不可读字符。

Can some one point out some leads which may help me to filter such rows. 有人可以指出一些可能有助于我过滤这些行的线索。

I have also tried to filter out those device IDs containing Specials. 我还试图过滤掉那些包含Specials的设备ID。 Using this: 使用这个：

df.filter($"$DEVICE_ID" rlike "/[^\�]/g") df.filter（$“$ DEVICE_ID”rlike“/ [^ \\ uFFFD] / g”）

I got an empty dataframe. 我有一个空的数据帧。

SCHEMA: 架构：

root
 |-- deviceId: string (nullable = true)
 |-- deviceIdNew: string (nullable = true)
 |-- imei: string (nullable = true)
 |-- wifiMacAddress: string (nullable = true)
 |-- bluetoothMacAddress: string (nullable = true)
 |-- timestamp: long (nullable = true)

ROWS WITH UNREADABLE CHARACTERS: 字符不可靠的行：

+--------------------+
|      trim(deviceId)|
+--------------------+
|                    |
|+~C���...|
|���
    Cv�...|
|���
    Cv�...|
|             �#Inten|
|                �$
                   �|
|                    |
|                    |
|                    |
|                    |
|    0353445a712d877b|
|    0577bc8a29754939|
|    0577bc8a29754939|
|    0577bc8a29754939|
|    0577bc8a29754939|
|    0577bc8a29754939|
|    0577bc8a29754939|
|    0577bc8a29754939|
|    08bdae9e37b48080|

UNREADABLE ROW VALUES 无法比拟的价值观

Answer 1

    val filteredDF=df.select("deviceId")
                     .filter((len(col("deviceId")) <17))
                     .filter($"$DEVICE_ID" rlike "^([A-Z]|[0-9]|[a-z])+$")

solved the issue. 解决了这个问题。

What I was not using earlier was regex wild cards ^ for start of match and $ for end of match. 我之前没有使用的是正则表达式通配符^匹配开始和$匹配结束。 This ensured that only rows with exact matching deviceId values gets through the filter. 这确保了只有具有精确匹配的deviceId值的行deviceId通过过滤器。

This website really helped me to generate and test desired regular expression. 这个网站真的帮助我生成并测试了所需的正则表达式。

Answer 2

You can filter by regular expression. 您可以按正则表达式过滤。 for example you might use regex_replace to replace all those with unreadable characters (ie everything EXCEPT alphanumeric or printable or whatever you decide) with some value (eg a 21 characters constant or even empty string) and then filter according to this. 例如，您可以使用regex_replace将具有不可读字符的所有字符（即除字母数字或可打印的任何内容或任何您决定的内容）替换为具有某些值（例如，21个字符常量或甚至空字符串），然后根据此进行过滤。

如何从包含不可读字符的spark数据帧中过滤掉行

问题描述

2 个解决方案

解决方案1
3 2016-11-25 06:04:24

解决方案2
0 2016-11-24 11:00:31

如何从包含不可读字符的spark数据帧中过滤掉行

问题描述

2 个解决方案

解决方案1 3 2016-11-25 06:04:24

解决方案2 0 2016-11-24 11:00:31

解决方案1
3 2016-11-25 06:04:24

解决方案2
0 2016-11-24 11:00:31