Connecting to Spark using Apache Phoenix Spark plugin and running custom SQL queries

Question

Could someone tell how to connect to Spark using Phoenix Spark connector rather than using Phoenix as a JDBC data source (This works if used as a JDBC source but performance is an issue https://phoenix.apache.org/phoenix_spark.html ).

This is my attempt to do it with the Phoenix driver but it throws a "Table Not Found" exception.

sql = '(select COL1, COL2 from TABLE where COL3 = 5) as TEMP_TABLE'

df2 = sqlContext.read\
                .format("org.apache.phoenix.spark")\
                .option("table", sql)\
                .option("zkUrl", "<HOSTNAME>:<PORT>")\
                .load()

results in

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 139, in load
return self._df(self._jreader.load())
File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 45, in deco
    return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o1625.load.
: org.apache.phoenix.schema.TableNotFoundException: ERROR 1012 (42M03): Table undefined. tableName=sql
    at org.apache.phoenix.schema.PMetaDataImpl.getTableRef(PMetaDataImpl.java:244)
    at org.apache.phoenix.jdbc.PhoenixConnection.getTable(PhoenixConnection.java:441)
    at org.apache.phoenix.util.PhoenixRuntime.getTable(PhoenixRuntime.java:379)
    at org.apache.phoenix.util.PhoenixRuntime.generateColumnInfo(PhoenixRuntime.java:405)
    at 

org.apache.phoenix.mapreduce.util.PhoenixConfigurationUtil.getSelectColumnMetadataList(PhoenixConfigurationUtil.java:279)
    at org.apache.phoenix.spark.PhoenixRDD.toDataFrame(PhoenixRDD.scala:105)
    at org.apache.phoenix.spark.PhoenixRelation.schema(PhoenixRelation.scala:57)
    at org.apache.spark.sql.execution.datasources.LogicalRelation.<init>(LogicalRelation.scala:37)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
    at sun.reflect.GeneratedMethodAccessor102.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:745)

Answer 1

The sql variable is pointless... You need to create a whole Dataframe from TABLE . The "table" option is not a SQL statement

You then use Spark's DataFrame API to select columns COL1, COL2 and filter for COL3 = 5

You can see examples here that TABLE1 is created, then used in the options, then (in the Scala examples), it's all Dataframe operations.

Phoenix Spark

In your case, once you load the table correctly, not using the sql variable, you'd have this

df3 = df2.select('COL1', 'COL2').where('COL3 = 5')

Or you're looking for how Spark APIs work outside of Phoenix...

Something like, Running queries programmatically

You use the Dataframe for raw queries, not pass the query to the construction of the Dataframe

df = sqlContext.read\
            .format("org.apache.phoenix.spark")\
            .option("table", "TABLE")\
            .option("zkUrl", "<HOSTNAME>:<PORT>")\
            .load()
#  df.createOrReplaceTempView("phoenix")  # Maybe necessary 

sqlDF = sqlContext.sql("SELECT COL1, COL2 FROM TABLE  where COL3 = 5")
sqlDF.show()

Connecting to Spark using Apache Phoenix Spark plugin and running custom SQL queries

Question

1 answers

solution1
1 2017-04-22 04:20:53

Connecting to Spark using Apache Phoenix Spark plugin and running custom SQL queries

Question

1 answers

solution1 1 2017-04-22 04:20:53

solution1
1 2017-04-22 04:20:53