Could someone tell how to connect to Spark using Phoenix Spark connector rather than using Phoenix as a JDBC data source (This works if used as a JDBC source but performance is an issue https://phoenix.apache.org/phoenix_spark.html ).
This is my attempt to do it with the Phoenix driver but it throws a "Table Not Found" exception.
sql = '(select COL1, COL2 from TABLE where COL3 = 5) as TEMP_TABLE'
df2 = sqlContext.read\
.format("org.apache.phoenix.spark")\
.option("table", sql)\
.option("zkUrl", "<HOSTNAME>:<PORT>")\
.load()
results in
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 139, in load
return self._df(self._jreader.load())
File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 45, in deco
return f(*a, **kw)
File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o1625.load.
: org.apache.phoenix.schema.TableNotFoundException: ERROR 1012 (42M03): Table undefined. tableName=sql
at org.apache.phoenix.schema.PMetaDataImpl.getTableRef(PMetaDataImpl.java:244)
at org.apache.phoenix.jdbc.PhoenixConnection.getTable(PhoenixConnection.java:441)
at org.apache.phoenix.util.PhoenixRuntime.getTable(PhoenixRuntime.java:379)
at org.apache.phoenix.util.PhoenixRuntime.generateColumnInfo(PhoenixRuntime.java:405)
at
org.apache.phoenix.mapreduce.util.PhoenixConfigurationUtil.getSelectColumnMetadataList(PhoenixConfigurationUtil.java:279)
at org.apache.phoenix.spark.PhoenixRDD.toDataFrame(PhoenixRDD.scala:105)
at org.apache.phoenix.spark.PhoenixRelation.schema(PhoenixRelation.scala:57)
at org.apache.spark.sql.execution.datasources.LogicalRelation.<init>(LogicalRelation.scala:37)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
at sun.reflect.GeneratedMethodAccessor102.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
The sql
variable is pointless... You need to create a whole Dataframe from TABLE
. The "table"
option is not a SQL statement
You then use Spark's DataFrame API to select
columns COL1, COL2
and filter
for COL3 = 5
You can see examples here that TABLE1
is created, then used in the options, then (in the Scala examples), it's all Dataframe operations.
In your case, once you load the table correctly, not using the sql
variable, you'd have this
df3 = df2.select('COL1', 'COL2').where('COL3 = 5')
Or you're looking for how Spark APIs work outside of Phoenix...
Something like, Running queries programmatically
You use the Dataframe for raw queries, not pass the query to the construction of the Dataframe
df = sqlContext.read\
.format("org.apache.phoenix.spark")\
.option("table", "TABLE")\
.option("zkUrl", "<HOSTNAME>:<PORT>")\
.load()
# df.createOrReplaceTempView("phoenix") # Maybe necessary
sqlDF = sqlContext.sql("SELECT COL1, COL2 FROM TABLE where COL3 = 5")
sqlDF.show()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.