PySpark 从所有列名中删除一个字符之前的字符串

Question

I have some column names in a dataset that have three underscores ___ in the string.我在数据集中有一些列名，字符串中有三个下划线 ___。 Using PySpark, I would like to remove all characters before the underscores including the underscores, and keep the remaining characters as column names.使用 PySpark，我想删除下划线之前的所有字符，包括下划线，并将剩余字符保留为列名。 I need the code to dynamically rename column names instead of writing column names in the code.我需要代码来动态重命名列名，而不是在代码中写入列名。 If ___ is at the start or end of the column name, it should only remove ___ and leave remaining characters as it is.如果 ___ 位于列名的开头或结尾，它应该只删除 ___ 并保留剩余的字符。

Example:例子：

Input column names:输入列名：

sequence_number   
department  
user___first_name  
user___last_name  
phone___mobile1
___city  
state___
zip_code

Desired output column names:所需的 output 列名称：

sequence_number   
department  
first_name  
last_name  
mobile1
city  
state
zip_code

Answer 1

Try with this:试试这个：

import re

def normalize(col):
    """removes *___ from beginning or end of column names"""
    col = col.rstrip("___")
    return re.sub(r'^(.*___)(.*)$', r'\2', col)

# nozmalize column names in dataframe
df = df.toDF(*[normalize(c) for c in df.columns])

PySpark 从所有列名中删除一个字符之前的字符串

问题描述

1 个解决方案

解决方案1
0 2022-07-28 15:21:57

PySpark 从所有列名中删除一个字符之前的字符串

问题描述

1 个解决方案

解决方案1 0 2022-07-28 15:21:57

解决方案1
0 2022-07-28 15:21:57