简体   繁体   English

如何从包含不可读字符的spark数据帧中过滤掉行

[英]How to filter out rows from spark dataframe containing unreadable characters

I am reading a parquet file containing some fields like device ID, imei, etc. This parquet file was written by reading a sequence file made of cascading.tuple.Tuple(s). 我正在读一个包含设备ID,imei等字段的镶木地板文件。这个镶木地板文件是通过读取由cascading.tuple.Tuple(s)组成的序列文件编写的。

Some rows contain unreadable characters which I want to ditch completely. 有些行包含不可读的字符,我想完全放弃。

Here is how I am reading the file: 这是我如何阅读文件:

val sparkSession = SparkSession.builder().master(sparkMaster).appName(sparkAppName).config("spark.driver.memory", "32g").getOrCreate()

sparkSession.sparkContext.hadoopConfiguration.set("io.serializations", "cascading.tuple.hadoop.TupleSerialization") 

val df=sparkSession.read.parquet("hdfs://**.46.**.2*2:8020/test/oldData.parquet")

df.printSchema()

val filteredDF=df.select($"$DEVICE_ID", $"$DEVICE_ID_NEW", $"$IMEI”, $”$WIFI_MAC_ADDRESS", $"$BLUETOOTH_MAC_ADDRESS", $"$TIMESTAMP").filter($"$TIMESTAMP" > 1388534400 && $"$TIMESTAMP" < 1483228800)

filteredDF.show(100)

import org.apache.spark.sql.functions.{udf,col,regexp_replace,trim}

val len=udf{ColVal:String => ColVal.size}

val new1DF=filteredDF.select(trim(col("deviceId")))

new1DF.show(100)

val newDF=new1DF.filter((len(col("deviceId")) <20))

newDF.show(100)

Even after applying a filter on those device ID whose length is less than 20, I still get those rows which has very long device ID containing mostly whitespaces and unreadable characters. 即使在长度小于20的设备ID上应用过滤器之后,我仍然会得到那些具有很长设备ID的行,其中包含大部分空格和不可读字符。

Can some one point out some leads which may help me to filter such rows. 有人可以指出一些可能有助于我过滤这些行的线索。

I have also tried to filter out those device IDs containing Specials. 我还试图过滤掉那些包含Specials的设备ID。 Using this: 使用这个:

df.filter($"$DEVICE_ID" rlike "/[^\�]/g") df.filter($“$ DEVICE_ID”rlike“/ [^ \\ uFFFD] / g”)

I got an empty dataframe. 我有一个空的数据帧。

SCHEMA: 架构:

root
 |-- deviceId: string (nullable = true)
 |-- deviceIdNew: string (nullable = true)
 |-- imei: string (nullable = true)
 |-- wifiMacAddress: string (nullable = true)
 |-- bluetoothMacAddress: string (nullable = true)
 |-- timestamp: long (nullable = true)

ROWS WITH UNREADABLE CHARACTERS: 字符不可靠的行:

+--------------------+
|      trim(deviceId)|
+--------------------+
|                    |
|+~C���...|
|���
    Cv�...|
|���
    Cv�...|
|             �#Inten|
|                �$
                   �|
|                    |
|                    |
|                    |
|                    |
|    0353445a712d877b|
|    0577bc8a29754939|
|    0577bc8a29754939|
|    0577bc8a29754939|
|    0577bc8a29754939|
|    0577bc8a29754939|
|    0577bc8a29754939|
|    0577bc8a29754939|
|    08bdae9e37b48080|

UNREADABLE ROW VALUES 无法比拟的价值观

    val filteredDF=df.select("deviceId")
                     .filter((len(col("deviceId")) <17))
                     .filter($"$DEVICE_ID" rlike "^([A-Z]|[0-9]|[a-z])+$") 

solved the issue. 解决了这个问题。

What I was not using earlier was regex wild cards ^ for start of match and $ for end of match. 我之前没有使用的是正则表达式通配符^匹配开始和$匹配结束。 This ensured that only rows with exact matching deviceId values gets through the filter. 这确保了只有具有精确匹配的deviceId值的行deviceId通过过滤器。

This website really helped me to generate and test desired regular expression. 这个网站真的帮助我生成并测试了所需的正则表达式。

You can filter by regular expression. 您可以按正则表达式过滤。 for example you might use regex_replace to replace all those with unreadable characters (ie everything EXCEPT alphanumeric or printable or whatever you decide) with some value (eg a 21 characters constant or even empty string) and then filter according to this. 例如,您可以使用regex_replace将具有不可读字符的所有字符(即除字母数字或可打印的任何内容或任何您决定的内容)替换为具有某些值(例如,21个字符常量或甚至空字符串),然后根据此进行过滤。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM