简体   繁体   English

AWS EMR Spark Glue PySpark -

[英]AWS EMR Spark Glue PySpark -

I am trying to read an AWS Glue table into pyspark.我正在尝试将 AWS Glue 表读入 pyspark。 I get a NullPointerException:我得到一个空指针异常:

spark.sql("show tables").show()
+----------------+-----------------+-----------+
|        database|        tableName|isTemporary|
+----------------+-----------------+-----------+
|test_datalake_db|events2_2017_test|      false|
|test_datalake_db|      events2_old|      false|
+----------------+-----------------+-----------+

Next, I tried selecting something from the table:接下来,我尝试从表中选择一些内容:

df = spark.sql("select * from events2_2017_test") df = spark.sql("select * from events2_2017_test")

However, things got messy:然而,事情变得一团糟:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/sql/session.py", line 603, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o51.sql.
: java.lang.NullPointerException: Name is null
    at java.lang.Enum.valueOf(Enum.java:236)

Also fails with something like:也失败了,比如:

myDf = spark.table("test_datalake_db.events2_2017_test") myDf = spark.table("test_datalake_db.events2_2017_test")

Here is the table Schema:这是表架构:

events2_2017_test 架构

The answer is you need to define the TableType as EXTERNAL_TABLE when creating the table:答案是您需要在创建表时将 TableType 定义为 EXTERNAL_TABLE:

def create_table():
    response = client.create_table(
        DatabaseName='xxxx-datalake',
        TableInput={
            'Name': 'events2_2017_sample',
            'Description': 'testing the events2 old data in s3',
            'Owner': 'drew',
            'Parameters': {
                "classification": "csv"
            },
            'TableType': 'EXTERNAL_TABLE',
            'StorageDescriptor': {
                'Columns': [
                    {'Name': 'sys_vortex_id', 'Type': 'string'},
                    {'Name': 'sys_app_id', 'Type': 'string'},
                    {'Name': 'sys_pq_id', 'Type': 'string'},
                    {'Name': 'sys_ip_address', 'Type': 'string'},
                    {'Name': 'sys_submitted_at', 'Type': 'string'},
                    {'Name': 'sys_received_at', 'Type': 'string'},
                    {'Name': 'device_id_type', 'Type': 'string'},
                    {'Name': 'device_id', 'Type': 'string'},
                    {'Name': 'timezone', 'Type': 'string'},
                    {'Name': 'online', 'Type': 'string'},
                    {'Name': 'app_version', 'Type': 'string'},
                    {'Name': 'device_days', 'Type': 'string'},
                    {'Name': 'device_sessions', 'Type': 'string'},
                    {'Name': 'event_id', 'Type': 'string'}
                ],
                'InputFormat': 'org.apache.hadoop.mapred.TextInputFormat',
                'OutputFormat': 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',
                'SerdeInfo': {
                    'Name': 'noidea',
                    'SerializationLibrary': 'org.apache.hadoop.hive.serde2.OpenCSVSerde',
                    'Parameters': {
                        'separatorChar': '|'
                    }
                },
                'Location': 's3://xxxx-datalake-us-east-1/prd/101/27a89ba8-774a-4bc3-68ae/laboratory/events2_2017_test/',
                'Compressed': True,
                'StoredAsSubDirectories': False,
            }
        }
    )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM