PySpark 使用正则表达式替换字符并删除 Databricks 上的列

Question

I am tring to remove a column and special characters from the dataframe shown below.我正在尝试从下面显示的数据框中删除一列和特殊字符。

The code below used to create the dataframe is as follows:下面用于创建数据框的代码如下：

dt = pd.read_csv(StringIO(response.text), delimiter="|", encoding='utf-8-sig')

The above produces the following output:以上产生以下输出：

I need help with regex to remove the characters Ã¯Â»Â¿ and delete the first column.我需要正则表达式的帮助来删除字符 Ã¯Â»Â¿ 并删除第一列。

As regards regex, I have tried the following:至于正则表达式，我尝试了以下方法：

dt.withColumn('Ã¯Â»Â¿COUNTRY ID', regexp_replace('Ã¯Â»Â¿COUNTRY ID', @"[^0-9a-zA-Z_]+"_ ""))

However, I'm getting a syntax error.但是，我收到语法错误。

Any help much appreciated.非常感谢任何帮助。

Answer 1

If the position of incoming column is fixed you can use regex to remove extra characters from column name like below如果传入列的位置是固定的，您可以使用正则表达式从列名中删除额外的字符，如下所示


import re

colname = pdf.columns[0]
colt=re.sub("[^0-9a-zA-Z_\s]+","",colname)
print(colname,colt)
pdf.rename(columns={colname:colt}, inplace = True)

And for dropping index column you can refer to this stack answer对于删除索引列，您可以参考此堆栈答案

Answer 2

You have read in the data as a pandas dataframe.您已将数据作为 pandas 数据框读入。 From what I see, you want a spark dataframe.据我所知，您需要一个火花数据框。 Convert from pandas to spark and rename columns.从 pandas 转换为 spark 并重命名列。 That will dropn pandas default index column which in your case you refer to as first column.这将删除 pandas 默认索引列，在您的情况下，您将其称为第一列。 You then can rename the columns.然后，您可以重命名列。 Code below下面的代码

df=spark.createDataFrame(df).toDF('COUNTRY',' COUNTRY NAME').show()

PySpark 使用正则表达式替换字符并删除 Databricks 上的列

问题描述

2 个解决方案

解决方案1
1 已采纳 2022-06-13 21:35:43

解决方案2
0 2022-06-13 22:40:33

PySpark 使用正则表达式替换字符并删除 Databricks 上的列

问题描述

2 个解决方案

解决方案1 1 已采纳 2022-06-13 21:35:43

解决方案2 0 2022-06-13 22:40:33

解决方案1
1 已采纳 2022-06-13 21:35:43

解决方案2
0 2022-06-13 22:40:33