简体   繁体   中英

PySpark remove string before a character from all column names

I have some column names in a dataset that have three underscores ___ in the string. Using PySpark, I would like to remove all characters before the underscores including the underscores, and keep the remaining characters as column names. I need the code to dynamically rename column names instead of writing column names in the code. If ___ is at the start or end of the column name, it should only remove ___ and leave remaining characters as it is.

Example:

Input column names:

sequence_number   
department  
user___first_name  
user___last_name  
phone___mobile1
___city  
state___
zip_code

Desired output column names:

sequence_number   
department  
first_name  
last_name  
mobile1
city  
state
zip_code

Try with this:

import re

def normalize(col):
    """removes *___ from beginning or end of column names"""
    col = col.rstrip("___")
    return re.sub(r'^(.*___)(.*)$', r'\2', col)

# nozmalize column names in dataframe
df = df.toDF(*[normalize(c) for c in df.columns])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM