简体   繁体   English

如何通过 amazon workspace 中的 EMR jupyter lab notebook 读取 postgres DB 表

[英]How to read postgres DB tables through EMR jupyter lab notebook from amazon workspace

I'm trying to read the table from postgres tables.我正在尝试从 postgres 表中读取表。 but i'm facing below error.但我面临以下错误。 Note: i cannot be able to refer external files from local since it is a private workspace.注意:我无法从本地引用外部文件,因为它是一个私有工作区。

JDBC: Eg: JDBC:例如:

"url":"jdbc:postgresql://xxxx-xxxxx-postgresql-prod01.cluster-xxxx.xx-xx-1.rds.amazonaws.com:0000/db_xxx_txxx",

Error i'm getting like: "错误我越来越喜欢:“

java.lang.ClassNotFoundException: org.postgresql.Driver

"

An error was encountered:
An error occurred while calling o153.jdbc.
: java.lang.ClassNotFoundException: org.postgresql.Driver
    at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
    at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:46)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1(JDBCOptions.scala:102)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1$adapted(JDBCOptions.scala:102)
    at scala.Option.foreach(Option.scala:407)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:102)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:38)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
    at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:225)
    at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:340)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:750)

i've tried below code.我试过下面的代码。

tables = read_table(
    url=URL,
    table="information_schema.tables",
    driver=DRIVER,
    user=USER,
    password=PASS
)

You need to add Postgres driver as a dependency/classpath first.您需要先将 Postgres 驱动程序添加为依赖项/类路径。

First copy the JAR onto the cluster or s3 and then in the first cell execute:首先将 JAR 复制到集群或 s3 上,然后在第一个单元格中执行:

%%configure -f
{ "conf":{
          "spark.jars": "s3://JAR-LOCATION/postgresql.jar"
         }
}

Ref.参考Postgres JAR with EMR and Jupyter Notebooks 带有 EMR 和 Jupyter 笔记本的 Postgres JAR

Alternatively, you can configure it while creating the SparkSession.或者,您可以在创建 SparkSession 时配置它。

spark = SparkSession.builder.config('spark.driver.extraClassPath', '/JAR-LOCATION/postgresql.jar').getOrCreate()

Update: Based on your comment since you can't push JAR, you can use maven dependency更新:根据你的评论,因为你不能推送 JAR,你可以使用 maven 依赖

%%configure -f
{
    "conf": {"spark.jars.packages": "org.postgresql:postgresql:jar:42.4.3"}
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM