简体   繁体   English

从不同的数据帧将列字典转换为 Dataframe:pyspark

[英]Convert dictionary of columns to Dataframe in from different dataframes : pyspark

I am trying to combine columns from different dataframes into one for analysis.我正在尝试将来自不同数据框的列合并为一个进行分析。 I am collecting all the columns I need into a dictionary.我正在将我需要的所有列收集到字典中。

I now have a dictionary like this -我现在有一本这样的字典-

newDFDict = {
    'schoolName': school.INSTNM,
    'type': school.CONTROL,
    'avgCostAcademicYear': costs.COSTT4_A, 
    'avgCostProgramYear': costs.COSTT4_P, 
    'averageNetPricePublic': costs.NPT4_PUB, 
}

{
 'schoolName': Column<b'INSTNM'>,
 'type': Column<b'CONTROL'>,
 'avgCostAcademicYear': Column<b'COSTT4_A'>,
 'avgCostProgramYear': Column<b'COSTT4_P'>,
 'averageNetPricePublic': Column<b'NPT4_PUB'>
}

I want to convert this dictionary to a Pyspark dataframe.我想将此字典转换为 Pyspark dataframe。

I have tried this method but the output is not what I was expecting -我已经尝试过这种方法,但 output 不是我所期望的 -

newDFDict = {
    'schoolName': school.select("INSTNM").collect(),
    'type': school.select("CONTROL").collect(),
    'avgCostAcademicYear': costs.select("COSTT4_A").collect(), 
    'avgCostProgramYear': costs.select("COSTT4_P").collect(), 
    'averageNetPricePublic': costs.select("NPT4_PUB").collect(), 
}

newDF = sc.parallelize([newDFDict]).toDF()
newDF.show()
+---------------------+--------------------+--------------------+--------------------+--------------------+
|averageNetPricePublic| avgCostAcademicYear|  avgCostProgramYear|          schoolName|                type|
+---------------------+--------------------+--------------------+--------------------+--------------------+
| [[NULL], [NULL], ...|[[NULL], [NULL], ...|[[NULL], [NULL], ...|[[Community Colle...|[[1], [1], [1], [...|
+---------------------+--------------------+--------------------+--------------------+--------------------+

Is it even possible?甚至可能吗? If possible, how?如果可能,怎么做?

Is this the right way to do this?这是正确的方法吗? If not, how can I achieve this?如果没有,我怎样才能做到这一点?

Using pandas is not an option as data is pretty big (2-3 GB) and pandas is just too slow.使用 pandas 不是一个选项,因为数据非常大(2-3 GB)并且 pandas 太慢了。 I am running pyspark on my local machine.我在本地机器上运行 pyspark。

Thanks in advance: :)提前致谢: :)

These are 2 options I'd suggest这是我建议的 2 个选项

Option1 (union case to build dictionary):选项1(建立字典的联合案例):

You said, you have >=10 tables (which you want to build dictionary from ) which has common columns (such as for example 'schoolName','type' 'avgCostAcademicYear','avgCostProgramYear', 'averageNetPricePublic' are common columns ) then you can go for union or unionByName to form single consolidated.你说,你有> = 10个表(你想从中构建字典),其中有公共列(例如'schoolName','type''avgCostAcademicYear','avgCostProgramYear','averageNetPricePublic'是公共列)然后您可以将 go 用于unionunionByName以形成单个合并。 view of the data.数据的视图。

For example:例如:

select 'schoolName','type' 'avgCostAcademicYear' ,'avgCostProgramYear' , 'averageNetPricePublic' from df1

 union  

select 'schoolName','type' 'avgCostAcademicYear' ,'avgCostProgramYear' , 'averageNetPricePublic' from df2
 ....
union
select 'schoolName','type' 'avgCostAcademicYear' ,'avgCostProgramYear' , 'averageNetPricePublic' from dfN 

will give you consolidated view of your dictionary将为您提供字典的统一视图

Option 2: (in case you have common join columns only)选项 2:(如果您只有公共连接列)

If you have some common join columns you can also go for standard joins no matter how many tables are present..如果您有一些常见的连接列,无论存在多少表,您也可以使用 go 进行标准连接。

for psuedo sql example:对于伪 sql 示例:

select dictionary columns from table1,table2,table3,... tablen where join common columns in all tables (table1... tablen)

note miss any join column will lead to cartesian product注意错过任何连接列都会导致笛卡尔积

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM