简体   繁体   English

PySpark:具有不同列的 DataFrame 的动态联合

[英]PySpark: dynamic union of DataFrames with different columns

Consider the arrays as shown here.考虑如下所示的数组。 I have 3 sets of array:我有 3 组数组:

Array 1:数组 1:

C1  C2  C3
1   2   3
9   5   6

Array 2:数组 2:

C2 C3 C4
11 12 13
10 15 16

Array 3:数组 3:

C1   C4
111  112
110  115

I need the output as following, the input I can get any one value for C1, ..., C4 but while joining I need to get correct values and if the value is not there then it should be zero.我需要如下输出,输入我可以获得 C1、...、C4 的任何一个值,但是在加入时我需要获得正确的值,如果该值不存在,则它应该为零。

Expected output:预期输出:

C1 C2 C3 C4
1  2  3  0
9  5  6  0
0  11 12 13
0 10 15 16
111 0 0 112
110 0 0 115

I have written pyspark code but I have hardcoded the value for the new column and its RAW, I need to convert the below code to method overloading, so that I can use this script as automatic one.我已经编写了 pyspark 代码,但是我已经对新列及其 RAW 的值进行了硬编码,我需要将以下代码转换为方法重载,以便我可以将此脚本用作自动脚本。 I need to use only python/pyspark not pandas.我只需要使用 python/pyspark 而不是 Pandas。

import pyspark
from pyspark import SparkContext
from pyspark.sql.functions import lit
from pyspark.sql import SparkSession

sqlContext = pyspark.SQLContext(pyspark.SparkContext())

df01 = sqlContext.createDataFrame([(1, 2, 3), (9, 5, 6)], ("C1", "C2", "C3"))
df02 = sqlContext.createDataFrame([(11,12, 13), (10, 15, 16)], ("C2", "C3", "C4"))
df03 = sqlContext.createDataFrame([(111,112), (110, 115)], ("C1", "C4"))

df01_add = df01.withColumn("C4", lit(0)).select("c1","c2","c3","c4")
df02_add = df02.withColumn("C1", lit(0)).select("c1","c2","c3","c4")
df03_add = df03.withColumn("C2", lit(0)).withColumn("C3", lit(0)).select("c1","c2","c3","c4")

df_uni = df01_add.union(df02_add).union(df03_add)
df_uni.show()

Method Overloading Example:方法重载示例:

class Student:
     def ___Init__ (self,m1,m2):
         self.m1 = m1
         self.m2 = m2

     def sum(self,c1=None,c2=None,c3=None,c4=None):
         s = 0
         if c1!= None and c2 != None and c3 != None:
            s = c1+c2+c3
         elif c1 != None and c2 != None:
             s = c1+c2
         else:
            s = c1
         return s

print(s1.sum(55,65,23))

There are probably plenty of better ways to do it, but maybe the below is useful to anyone in the future.可能有很多更好的方法可以做到这一点,但也许以下内容对将来的任何人都有用。

from pyspark.sql import SparkSession
from pyspark.sql.functions import lit

spark = SparkSession.builder\
    .appName("DynamicFrame")\
    .getOrCreate()

df01 = spark.createDataFrame([(1, 2, 3), (9, 5, 6)], ("C1", "C2", "C3"))
df02 = spark.createDataFrame([(11,12, 13), (10, 15, 16)], ("C2", "C3", "C4"))
df03 = spark.createDataFrame([(111,112), (110, 115)], ("C1", "C4"))

dataframes = [df01, df02, df03]

# Create a list of all the column names and sort them
cols = set()
for df in dataframes:
    for x in df.columns:
        cols.add(x)
cols = sorted(cols)

# Create a dictionary with all the dataframes
dfs = {}
for i, d in enumerate(dataframes):
    new_name = 'df' + str(i)  # New name for the key, the dataframe is the value
    dfs[new_name] = d
    # Loop through all column names. Add the missing columns to the dataframe (with value 0)
    for x in cols:
        if x not in d.columns:
            dfs[new_name] = dfs[new_name].withColumn(x, lit(0))
    dfs[new_name] = dfs[new_name].select(cols)  # Use 'select' to get the columns sorted

# Now put it al together with a loop (union)
result = dfs['df0']      # Take the first dataframe, add the others to it
dfs_to_add = dfs.keys()  # List of all the dataframes in the dictionary
dfs_to_add.remove('df0') # Remove the first one, because it is already in the result
for x in dfs_to_add:
    result = result.union(dfs[x])
result.show()

Output:输出:

+---+---+---+---+
| C1| C2| C3| C4|
+---+---+---+---+
|  1|  2|  3|  0|
|  9|  5|  6|  0|
|  0| 11| 12| 13|
|  0| 10| 15| 16|
|111|  0|  0|112|
|110|  0|  0|115|
+---+---+---+---+

Consider the arrays as shown here.考虑如下所示的数组。 I have 3 sets of array:我有3套数组:

Array 1:阵列1:

C1  C2  C3
1   2   3
9   5   6

Array 2:阵列2:

C2 C3 C4
11 12 13
10 15 16

Array 3:阵列3:

C1   C4
111  112
110  115

I need the output as following, the input I can get any one value for C1, ..., C4 but while joining I need to get correct values and if the value is not there then it should be zero.我需要以下输出,输入我可以获得C1,...,C4的任何一个值,但是在加入时我需要获取正确的值,如果该值不存在,则应该为零。

Expected output:预期产量:

C1 C2 C3 C4
1  2  3  0
9  5  6  0
0  11 12 13
0 10 15 16
111 0 0 112
110 0 0 115

I have written pyspark code but I have hardcoded the value for the new column and its RAW, I need to convert the below code to method overloading, so that I can use this script as automatic one.我已经写了pyspark代码,但是我已经硬编码了新列及其RAW的值,我需要将以下代码转换为方法重载,以便可以将此脚本用作自动脚本。 I need to use only python/pyspark not pandas.我只需要使用python / pyspark而不是熊猫。

import pyspark
from pyspark import SparkContext
from pyspark.sql.functions import lit
from pyspark.sql import SparkSession

sqlContext = pyspark.SQLContext(pyspark.SparkContext())

df01 = sqlContext.createDataFrame([(1, 2, 3), (9, 5, 6)], ("C1", "C2", "C3"))
df02 = sqlContext.createDataFrame([(11,12, 13), (10, 15, 16)], ("C2", "C3", "C4"))
df03 = sqlContext.createDataFrame([(111,112), (110, 115)], ("C1", "C4"))

df01_add = df01.withColumn("C4", lit(0)).select("c1","c2","c3","c4")
df02_add = df02.withColumn("C1", lit(0)).select("c1","c2","c3","c4")
df03_add = df03.withColumn("C2", lit(0)).withColumn("C3", lit(0)).select("c1","c2","c3","c4")

df_uni = df01_add.union(df02_add).union(df03_add)
df_uni.show()

Method Overloading Example:方法重载示例:

class Student:
     def ___Init__ (self,m1,m2):
         self.m1 = m1
         self.m2 = m2

     def sum(self,c1=None,c2=None,c3=None,c4=None):
         s = 0
         if c1!= None and c2 != None and c3 != None:
            s = c1+c2+c3
         elif c1 != None and c2 != None:
             s = c1+c2
         else:
            s = c1
         return s

print(s1.sum(55,65,23))

我会尝试

df = df1.join(df2, ['each', 'shared', 'col], how='full')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Pyspark - 从两个不同的数据帧中减去列 - Pyspark - Subtract columns from two different dataframes PySpark中多个数据帧的迭代联合 - Iterative union of multiple dataframes in PySpark 如何在 spark 中对具有不同列数的两个 DataFrame 执行联合? - How to perform union on two DataFrames with different amounts of columns in spark? 对具有相同列,不同顺序的三个按组分组结果数据帧执行联合 - Performing union on three Group by Resultant dataframes with same columns, different order 将不同数据框中的列添加到PySpark中的目标数据框 - Add columns from different dataframes to target dataframe in PySpark 如何在pyspark中比较两个不同数据框中的两列 - How to compare two columns in two different dataframes in pyspark PySpark DataFrames - 使用不同类型列之间的比较进行过滤 - PySpark DataFrames - filtering using comparisons between columns of different types 如何将pyspark中的两个数据框与结构或数组中的不同列合并? - How to merge two dataframes in pyspark with different columns inside struct or array? 从不同的数据帧将列字典转换为 Dataframe:pyspark - Convert dictionary of columns to Dataframe in from different dataframes : pyspark Pyspark:匹配来自两个不同数据帧的列并添加值 - Pyspark: match columns from two different dataframes and add value
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM