从不同的数据帧将列字典转换为 Dataframe：pyspark

Question

I am trying to combine columns from different dataframes into one for analysis.我正在尝试将来自不同数据框的列合并为一个进行分析。 I am collecting all the columns I need into a dictionary.我正在将我需要的所有列收集到字典中。

I now have a dictionary like this -我现在有一本这样的字典-

newDFDict = {
    'schoolName': school.INSTNM,
    'type': school.CONTROL,
    'avgCostAcademicYear': costs.COSTT4_A, 
    'avgCostProgramYear': costs.COSTT4_P, 
    'averageNetPricePublic': costs.NPT4_PUB, 
}

{
 'schoolName': Column<b'INSTNM'>,
 'type': Column<b'CONTROL'>,
 'avgCostAcademicYear': Column<b'COSTT4_A'>,
 'avgCostProgramYear': Column<b'COSTT4_P'>,
 'averageNetPricePublic': Column<b'NPT4_PUB'>
}

I want to convert this dictionary to a Pyspark dataframe.我想将此字典转换为 Pyspark dataframe。

I have tried this method but the output is not what I was expecting -我已经尝试过这种方法，但 output 不是我所期望的 -

newDFDict = {
    'schoolName': school.select("INSTNM").collect(),
    'type': school.select("CONTROL").collect(),
    'avgCostAcademicYear': costs.select("COSTT4_A").collect(), 
    'avgCostProgramYear': costs.select("COSTT4_P").collect(), 
    'averageNetPricePublic': costs.select("NPT4_PUB").collect(), 
}

newDF = sc.parallelize([newDFDict]).toDF()
newDF.show()
+---------------------+--------------------+--------------------+--------------------+--------------------+
|averageNetPricePublic| avgCostAcademicYear|  avgCostProgramYear|          schoolName|                type|
+---------------------+--------------------+--------------------+--------------------+--------------------+
| [[NULL], [NULL], ...|[[NULL], [NULL], ...|[[NULL], [NULL], ...|[[Community Colle...|[[1], [1], [1], [...|
+---------------------+--------------------+--------------------+--------------------+--------------------+

Is it even possible?甚至可能吗？ If possible, how?如果可能，怎么做？

Is this the right way to do this?这是正确的方法吗？ If not, how can I achieve this?如果没有，我怎样才能做到这一点？

Using pandas is not an option as data is pretty big (2-3 GB) and pandas is just too slow.使用 pandas 不是一个选项，因为数据非常大（2-3 GB）并且 pandas 太慢了。 I am running pyspark on my local machine.我在本地机器上运行 pyspark。

Thanks in advance: :)提前致谢：：）

Answer 1

These are 2 options I'd suggest这是我建议的 2 个选项

Option1 (union case to build dictionary):选项1（建立字典的联合案例）：

You said, you have >=10 tables (which you want to build dictionary from ) which has common columns (such as for example 'schoolName','type' 'avgCostAcademicYear','avgCostProgramYear', 'averageNetPricePublic' are common columns ) then you can go for union or unionByName to form single consolidated.你说，你有> = 10个表（你想从中构建字典），其中有公共列（例如'schoolName'，'type''avgCostAcademicYear'，'avgCostProgramYear'，'averageNetPricePublic'是公共列）然后您可以将 go 用于union或unionByName以形成单个合并。 view of the data.数据的视图。

For example:例如：

select 'schoolName','type' 'avgCostAcademicYear' ,'avgCostProgramYear' , 'averageNetPricePublic' from df1

 union  

select 'schoolName','type' 'avgCostAcademicYear' ,'avgCostProgramYear' , 'averageNetPricePublic' from df2
 ....
union
select 'schoolName','type' 'avgCostAcademicYear' ,'avgCostProgramYear' , 'averageNetPricePublic' from dfN

will give you consolidated view of your dictionary将为您提供字典的统一视图

Option 2: (in case you have common join columns only)选项 2：（如果您只有公共连接列）

If you have some common join columns you can also go for standard joins no matter how many tables are present..如果您有一些常见的连接列，无论存在多少表，您也可以使用 go 进行标准连接。

for psuedo sql example:对于伪 sql 示例：

select dictionary columns from table1,table2,table3,... tablen where join common columns in all tables (table1... tablen)

note miss any join column will lead to cartesian product注意错过任何连接列都会导致笛卡尔积

从不同的数据帧将列字典转换为 Dataframe：pyspark

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-05-13 18:23:33

Option1 (union case to build dictionary):选项1（建立字典的联合案例）：

Option 2: (in case you have common join columns only)选项 2：（如果您只有公共连接列）

从不同的数据帧将列字典转换为 Dataframe：pyspark

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-05-13 18:23:33

Option1 (union case to build dictionary):选项1（建立字典的联合案例）：

Option 2: (in case you have common join columns only)选项 2：（如果您只有公共连接列）

解决方案1
1 已采纳 2020-05-13 18:23:33