如何使用 spark.read.jdbc 读取不同 Pyspark 数据帧中的多个文件

Question

I have a code to read multiple files (>10) into different dataframes in Pyspark.我有一个代码可以将多个文件（> 10 个）读入 Pyspark 中的不同数据帧。 However, I would like to optimize this piece of code using a for loop and a reference variable or something like that.但是，我想使用 for 循环和引用变量或类似的东西来优化这段代码。 My code is as follows:我的代码如下：

Features_PM = (spark.read
          .jdbc(url=jdbcUrl, table='Features_PM',
                properties=connectionProperties))

Features_CM = (spark.read
          .jdbc(url=jdbcUrl, table='Features_CM',
                properties=connectionProperties))

I tried something like this but it didn't work:我试过这样的事情，但没有奏效：

table_list = ['table1', 'table2','table3', 'table4']

for table in table_list:
     jdbcDF = spark.read \
         .format("jdbc") \
         .option("url", "jdbc:postgresql:dbserver") \
         .option("dbtable", "schema.{}".format(table)) \
         .option("user", "username") \
         .option("password", "password") \
         .load()

Source for the above snippet: https://community.cloudera.com/t5/Support-Questions/read-multiple-table-parallel-using-Spark/td-p/286498以上片段的来源： https : //community.cloudera.com/t5/Support-Questions/read-multiple-table-parallel-using-Spark/td-p/286498

Any help would be appreciated.任何帮助，将不胜感激。 Thanks谢谢

Answer 1

You can use the following code to achieve your end goal.您可以使用以下代码来实现您的最终目标。 You will get a dictionary of dataframes where the key is the table name and value is teh appropriate dataframe您将获得一个数据框字典，其中键是表名，值是适当的数据框

def read_table(opts):
    return spark.read.format("jdbc").options(**opts).load()

table_list = ['table1', 'table2','table3', 'table4']



table_df_dict = {table: read_table({"url":"jdbc:postgresql:dbserver",
                                   "dbtable":"schema.{}".format(table),
                                   "user": "username",
                                   "password":"password"})
                 for table in table_list}

如何使用 spark.read.jdbc 读取不同 Pyspark 数据帧中的多个文件

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-01-03 16:53:17

如何使用 spark.read.jdbc 读取不同 Pyspark 数据帧中的多个文件

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-01-03 16:53:17

解决方案1
1 已采纳 2021-01-03 16:53:17