[英]PySpark remove string before a character from all column names
I have some column names in a dataset that have three underscores ___ in the string.我在数据集中有一些列名,字符串中有三个下划线 ___。 Using PySpark, I would like to remove all characters before the underscores including the underscores, and keep the remaining characters as column names.
使用 PySpark,我想删除下划线之前的所有字符,包括下划线,并将剩余字符保留为列名。 I need the code to dynamically rename column names instead of writing column names in the code.
我需要代码来动态重命名列名,而不是在代码中写入列名。 If ___ is at the start or end of the column name, it should only remove ___ and leave remaining characters as it is.
如果 ___ 位于列名的开头或结尾,它应该只删除 ___ 并保留剩余的字符。
Example:例子:
Input column names:输入列名:
sequence_number
department
user___first_name
user___last_name
phone___mobile1
___city
state___
zip_code
Desired output column names:所需的 output 列名称:
sequence_number
department
first_name
last_name
mobile1
city
state
zip_code
Try with this:试试这个:
import re
def normalize(col):
"""removes *___ from beginning or end of column names"""
col = col.rstrip("___")
return re.sub(r'^(.*___)(.*)$', r'\2', col)
# nozmalize column names in dataframe
df = df.toDF(*[normalize(c) for c in df.columns])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.