简体   繁体   English

使用 Pyspark 从 dataframe 列中删除非 ASCII 字符和特定字符

[英]Remove non-ASCII and specific characters from a dataframe column using Pyspark

I would to clean up data in a dataframe column City.我想清理 dataframe 列城市中的数据。 It can have the following values:它可以具有以下值:

Venice® VeniceÆ Venice? Venice® VeniceÆ 威尼斯? Venice Venice ® Venice威尼斯Venice ® 威尼斯

I would like to remove all the non ascii characters as well as?, and.我想删除所有非 ascii 字符以及?,和。 How can I achieve it?我怎样才能实现它?

You can clean up strings with Regex by filtering only on letters您可以通过仅过滤字母来使用 Regex 清理字符串

# create dataframes
date_data = [
    (1,"Venice®"),
    (2,"VeniceÆ"),
    (3,"Venice?"),
    (4,"Venice")]

schema = ["id","name"]
df_raw = spark.createDataFrame(data=date_data, schema = schema)
df_raw.show()

+---+--------+
|id |name    |
+---+--------+
|1  |Venice®|
|2  |VeniceÆ |
|3  |Venice? |
|4  |Venice  |
+---+--------+

# apply regular expression
df_clean=(df_raw.withColumn("clean_name",f.regexp_replace(f.col("name"), "[^a-zA-Z]", "")))
df_clean.show()

+---+--------+----------+
| id|    name|clean_name|
+---+--------+----------+
|  1|Venice®|    Venice|
|  2| VeniceÆ|    Venice|
|  3| Venice?|    Venice|
|  4|  Venice|    Venice|
+---+--------+----------+

PS: But I doubt that you see such characters after correct import to spark. PS:但我怀疑你在正确导入 spark 后看到这样的字符。 Superscript for example is ignored例如上标被忽略

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM