[英]Remove non-ASCII and specific characters from a dataframe column using Pyspark
I would to clean up data in a dataframe column City.我想清理 dataframe 列城市中的数据。 It can have the following values:
它可以具有以下值:
Venice® VeniceÆ Venice? Venice® VeniceÆ 威尼斯? Venice Venice ® Venice
威尼斯Venice ® 威尼斯
I would like to remove all the non ascii characters as well as?, and.我想删除所有非 ascii 字符以及?,和。 How can I achieve it?
我怎样才能实现它?
You can clean up strings with Regex by filtering only on letters您可以通过仅过滤字母来使用 Regex 清理字符串
# create dataframes
date_data = [
(1,"Venice®"),
(2,"VeniceÆ"),
(3,"Venice?"),
(4,"Venice")]
schema = ["id","name"]
df_raw = spark.createDataFrame(data=date_data, schema = schema)
df_raw.show()
+---+--------+
|id |name |
+---+--------+
|1 |Venice®|
|2 |VeniceÆ |
|3 |Venice? |
|4 |Venice |
+---+--------+
# apply regular expression
df_clean=(df_raw.withColumn("clean_name",f.regexp_replace(f.col("name"), "[^a-zA-Z]", "")))
df_clean.show()
+---+--------+----------+
| id| name|clean_name|
+---+--------+----------+
| 1|Venice®| Venice|
| 2| VeniceÆ| Venice|
| 3| Venice?| Venice|
| 4| Venice| Venice|
+---+--------+----------+
PS: But I doubt that you see such characters after correct import to spark. PS:但我怀疑你在正确导入 spark 后看到这样的字符。 Superscript for example is ignored
例如上标被忽略
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.