PySpark remove string before a character from all column names

Question

I have some column names in a dataset that have three underscores ___ in the string. Using PySpark, I would like to remove all characters before the underscores including the underscores, and keep the remaining characters as column names. I need the code to dynamically rename column names instead of writing column names in the code. If ___ is at the start or end of the column name, it should only remove ___ and leave remaining characters as it is.

Example:

Input column names:

sequence_number   
department  
user___first_name  
user___last_name  
phone___mobile1
___city  
state___
zip_code

Desired output column names:

sequence_number   
department  
first_name  
last_name  
mobile1
city  
state
zip_code

Answer 1

Try with this:

import re

def normalize(col):
    """removes *___ from beginning or end of column names"""
    col = col.rstrip("___")
    return re.sub(r'^(.*___)(.*)$', r'\2', col)

# nozmalize column names in dataframe
df = df.toDF(*[normalize(c) for c in df.columns])

PySpark remove string before a character from all column names

Question

1 answers

solution1
0 2022-07-28 15:21:57

PySpark remove string before a character from all column names

Question

1 answers

solution1 0 2022-07-28 15:21:57

solution1
0 2022-07-28 15:21:57