![](/img/trans.png)
[英]Pyspark replace strings in Spark dataframe column by using values in another column
[英]Replace pyspark column with another dataframe's column / merge pyspark dataframes
我有兩個數據框,如下所列。 預期的 output 也在下方。 數據幀的不同之處在於“學院”列,第二個 dataframe 短了一行。 當 student_ID 和 student_NAME 匹配時,我想將 df2 中的“學院”列替換為 df1 中的“學院”列。 有誰知道如何獲得預期的 output?
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of students data
data = [["1", "Amit", "DU"],
["2", "Mohit", "DU"],
["3", "rohith", "BHU"],
["4", "sridevi", "LPU"],
["1", "sravan", "KLMP"],
["5", "gnanesh", "IIT"]]
# specify column names
columns = ['student_ID', 'student_NAME', 'college']
# creating a dataframe from the lists of data
df1 = spark.createDataFrame(data, columns)
data2 = [["1", "Amit", "jewf"],
["4", "sridevi", "wfv"],
["2", "Mohit", "efgew"],
["3", "rohith", "vwefv"],
["1", "sravan", "KLMP"],
["5", "gnanesh", "wfvw"]]
# specify column names
columns2 = ['student_ID', 'student_NAME', 'college']
# creating a dataframe from the lists of data
df2 = spark.createDataFrame(data2, columns2)
# expected output:
# [["1", "Amit", "DU"],
# ["4", "sridevi", "LPU"],
# ["2", "Mohit", "DU"],
# ["3", "rohith", "BHU"],
# ["5", "sravan", "IIT"]]
這可以通過一種解決方法來實現——使用.withCoulumnRenamed() 將df1.college重命名為任何其他列名,比如school_1 。 加入 df1 和 df2 並以student_id條件存儲在 df3 中。 Select 需要來自 df3 的列。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.