[英]Pyspark dataframe apply function to a row and add row to bottom of dataframe
I have a df that only has one row.我有一个只有一行的df。
id |id2 |score|score2|
----------------------
0 |1 |4 |2 |
and i want to add a row of the percent of these to the bottom, ie every number divided by 7我想在底部添加一行百分比,即每个数字除以 7
0/7 |1/7 |4/7 |2/7 |
but the solution I came up with is incredibly slow但我想出的解决方案非常慢
temp = [i/7 for i in df.collect()[0]]
row = sc.parallelize(Row(temp)).toDF()
df.union(row)
This took 21 seconds to run, almost all of which is the last two lines of code.这花了 21 秒运行,几乎都是最后两行代码。 Is there a better way to do this?有一个更好的方法吗? My other thought was to transpose the table then this can easily be done with df.withColumn().我的另一个想法是转置表格,然后可以使用 df.withColumn() 轻松完成。 Ideally, I also want to filter out the column with 0, but I haven't really looked into that yet理想情况下,我还想用 0 过滤掉列,但我还没有真正研究过
check this out and let me know if it helps看看这个,让我知道它是否有帮助
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder \
.appName('practice')\
.getOrCreate()
sc= spark.sparkContext
df = sc.parallelize([
(0,1,4,2)]).toDF(["id", "id2","score","score2"])
df2 = df.select(*[(F.col(column)/7).alias(column) for column in df.columns])
df3 = df.union(df2)
df3.show()
+---+-------------------+------------------+------------------+
| id| id2| score| score2|
+---+-------------------+------------------+------------------+
|0.0| 1.0| 4.0| 2.0|
|0.0|0.14285714285714285|0.5714285714285714|0.2857142857142857|
+---+-------------------+------------------+------------------+
If you want to.如果你想。 filter out the column having 0 you can use below code过滤掉具有 0 的列,您可以使用下面的代码
non_zero_cols = [c for c in df.columns if df[[c]].first()[c] > 0]
df1 = df.select(*non_zero_cols)
df2 = df1.select(*[(F.col(column)/7).alias(column) for column in
df1.columns])
df3 = df1.union(df2)
df3.show()
+-------------------+------------------+------------------+
| id2| score| score2|
+-------------------+------------------+------------------+
| 1.0| 4.0| 2.0|
|0.14285714285714285|0.5714285714285714|0.2857142857142857|
+-------------------+------------------+------------------+
Please check the below code for df having type column请检查以下代码以获取具有类型列的 df
non_zero_cols = [c for c in df.columns if df[[c]].first()[c] > 0]
df1 = df.select(*non_zero_cols, F.lit('count').alias('type') )
df2 = df1.select(*[(F.col(column)/7).alias(column) for column in
df1.columns if not column=='type'], F.lit('percent').alias('type'))
df3 = df1.union(df2)
df3.show()
+-------------------+------------------+------------------+-------+
| id2| score| score2| type|
+-------------------+------------------+------------------+-------+
| 1.0| 4.0| 2.0| count|
|0.14285714285714285|0.5714285714285714|0.2857142857142857|percent|
+-------------------+------------------+------------------+-------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.