[英]How to Read multiple files in different Pyspark Dataframes using spark.read.jdbc
I have a code to read multiple files (>10) into different dataframes in Pyspark.我有一个代码可以将多个文件(> 10 个)读入 Pyspark 中的不同数据帧。 However, I would like to optimize this piece of code using a for loop and a reference variable or something like that.但是,我想使用 for 循环和引用变量或类似的东西来优化这段代码。 My code is as follows:我的代码如下:
Features_PM = (spark.read
.jdbc(url=jdbcUrl, table='Features_PM',
properties=connectionProperties))
Features_CM = (spark.read
.jdbc(url=jdbcUrl, table='Features_CM',
properties=connectionProperties))
I tried something like this but it didn't work:我试过这样的事情,但没有奏效:
table_list = ['table1', 'table2','table3', 'table4']
for table in table_list:
jdbcDF = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql:dbserver") \
.option("dbtable", "schema.{}".format(table)) \
.option("user", "username") \
.option("password", "password") \
.load()
Source for the above snippet: https://community.cloudera.com/t5/Support-Questions/read-multiple-table-parallel-using-Spark/td-p/286498以上片段的来源: https : //community.cloudera.com/t5/Support-Questions/read-multiple-table-parallel-using-Spark/td-p/286498
Any help would be appreciated.任何帮助,将不胜感激。 Thanks谢谢
You can use the following code to achieve your end goal.您可以使用以下代码来实现您的最终目标。 You will get a dictionary of dataframes where the key is the table name and value is teh appropriate dataframe您将获得一个数据框字典,其中键是表名,值是适当的数据框
def read_table(opts):
return spark.read.format("jdbc").options(**opts).load()
table_list = ['table1', 'table2','table3', 'table4']
table_df_dict = {table: read_table({"url":"jdbc:postgresql:dbserver",
"dbtable":"schema.{}".format(table),
"user": "username",
"password":"password"})
for table in table_list}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.