[英]Replace characters in column names in pyspark data frames
I have a dataframew like below in Pyspark我在 Pyspark 中有一个如下所示的数据框
df = spark.createDataFrame([(2,'john',1,1),
(2,'john',1,2),
(3,'pete',8,3),
(3,'pete',8,4),
(5,'steve',9,5)],
['id','/na/me','val/ue', 'rank/'])
df.show()
+---+------+------+-----+
| id|/na/me|val/ue|rank/|
+---+------+------+-----+
| 2| john| 1| 1|
| 2| john| 1| 2|
| 3| pete| 8| 3|
| 3| pete| 8| 4|
| 5| steve| 9| 5|
+---+------+------+-----+
Now in this data frame I want to replace the column names where /
to under scrore _
.现在在这个数据框中,我想替换
/
到 scrore _
下的列名。 But if the /
comes at the start or end of the column name then remove the /
but don't replace with _
.但是如果
/
出现在列名的开头或结尾,则删除/
但不要替换为_
。
I have done like below我做了如下
for name in df.schema.names:
df = df.withColumnRenamed(name, name.replace('/', '_'))
>>> df
DataFrame[id: bigint, _na_me: string, val_ue: bigint, rank_: bigint]
>>>df.show()
+---+------+------+-----+
| id|_na_me|val_ue|rank_|
+---+------+------+-----+
| 2| john| 1| 1|
| 2| john| 1| 2|
| 3| pete| 8| 3|
| 3| pete| 8| 4|
| 5| steve| 9| 5|
+---+------+------+-----+
How can I achieve my desired result which is below我怎样才能达到我想要的结果,如下所示
+---+------+------+-----+
| id| na_me|val_ue| rank|
+---+------+------+-----+
| 2| john| 1| 1|
| 2| john| 1| 2|
| 3| pete| 8| 3|
| 3| pete| 8| 4|
| 5| steve| 9| 5|
+---+------+------+-----+
Try with regular expression
replace(re.sub) in python way.尝试以 python 方式使用
regular expression
替换(re.sub)。
import re
cols=[re.sub(r'(^_|_$)','',f.replace("/","_")) for f in df.columns]
df = spark.createDataFrame([(2,'john',1,1),
(2,'john',1,2),
(3,'pete',8,3),
(3,'pete',8,4),
(5,'steve',9,5)],
['id','/na/me','val/ue', 'rank/'])
df.toDF(*cols).show()
#+---+-----+------+----+
#| id|na_me|val_ue|rank|
#+---+-----+------+----+
#| 2| john| 1| 1|
#| 2| john| 1| 2|
#| 3| pete| 8| 3|
#| 3| pete| 8| 4|
#| 5|steve| 9| 5|
#+---+-----+------+----+
#or using for loop on schema.names
for name in df.schema.names:
df = df.withColumnRenamed(name, re.sub(r'(^_|_$)','',name.replace('/', '_')))
df.show()
#+---+-----+------+----+
#| id|na_me|val_ue|rank|
#+---+-----+------+----+
#| 2| john| 1| 1|
#| 2| john| 1| 2|
#| 3| pete| 8| 3|
#| 3| pete| 8| 4|
#| 5|steve| 9| 5|
#+---+-----+------+----+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.