[英]Unable to access shared Glue table from Jupyter notebook
我正在使用 Jupyter 笔记本从共享 Glue 表中读取数据(使用 LakeFormation)。 我正在使用awswrangler
库。 我能够从示例数据库中读取测试表。 请注意,下面提到的local_db
数据库是在我运行此查询的同一 AWS 账户中本地创建的。 lf_shared_table
表是共享表的资源链接。
import awswrangler as wr
df = wr.athena.read_sql_query(sql="SELECT * FROM lf_shared_table", database="local_db")
错误
~/anaconda3/envs/python3/lib/python3.6/site-packages/awswrangler/_config.py in wrapper(*args_raw, **kwargs)
437 del args[name]
438 args = {**args, **keywords}
--> 439 return function(**args)
440
441 wrapper.__doc__ = _inject_config_doc(doc=function.__doc__, available_configs=available_configs)
~/anaconda3/envs/python3/lib/python3.6/site-packages/awswrangler/athena/_read.py in read_sql_query(sql, database, ctas_approach, categories, chunksize, s3_output, workgroup, encryption, kms_key, keep_files, ctas_database_name, ctas_temp_table_name, ctas_bucketing_info, use_threads, boto3_session, max_cache_seconds, max_cache_query_inspections, max_remote_cache_entries, max_local_cache_entries, data_source, params, s3_additional_kwargs, pyarrow_additional_kwargs)
878 s3_additional_kwargs=s3_additional_kwargs,
879 boto3_session=session,
--> 880 pyarrow_additional_kwargs=pyarrow_additional_kwargs,
881 )
882
~/anaconda3/envs/python3/lib/python3.6/site-packages/awswrangler/athena/_read.py in _resolve_query_without_cache(sql, database, data_source, ctas_approach, categories, chunksize, s3_output, workgroup, encryption, kms_key, keep_files, ctas_database_name, ctas_temp_table_name, ctas_bucketing_info, use_threads, s3_additional_kwargs, boto3_session, pyarrow_additional_kwargs)
593 use_threads=use_threads,
594 s3_additional_kwargs=s3_additional_kwargs,
--> 595 boto3_session=boto3_session,
596 )
597
~/anaconda3/envs/python3/lib/python3.6/site-packages/awswrangler/athena/_read.py in _resolve_query_without_cache_regular(sql, database, data_source, s3_output, keep_files, chunksize, categories, encryption, workgroup, kms_key, wg_config, use_threads, s3_additional_kwargs, boto3_session)
508 boto3_session=boto3_session,
509 categories=categories,
--> 510 metadata_cache_manager=_cache_manager,
511 )
512 return _fetch_csv_result(
~/anaconda3/envs/python3/lib/python3.6/site-packages/awswrangler/athena/_utils.py in _get_query_metadata(query_execution_id, boto3_session, categories, query_execution_payload, metadata_cache_manager)
259 _query_execution_payload: Dict[str, Any] = query_execution_payload
260 else:
--> 261 _query_execution_payload = wait_query(query_execution_id=query_execution_id, boto3_session=boto3_session)
262 cols_types: Dict[str, str] = get_query_columns_types(
263 query_execution_id=query_execution_id, boto3_session=boto3_session
~/anaconda3/envs/python3/lib/python3.6/site-packages/awswrangler/athena/_utils.py in wait_query(query_execution_id, boto3_session)
795 _logger.debug("StateChangeReason: %s", response["Status"].get("StateChangeReason"))
796 if state == "FAILED":
--> 797 raise exceptions.QueryFailed(response["Status"].get("StateChangeReason"))
798 if state == "CANCELLED":
799 raise exceptions.QueryCancelled(response["Status"].get("StateChangeReason"))
QueryFailed: HIVE_METASTORE_ERROR: Required Table Storage Descriptor is not populated. (Service: null; Status Code: 0; Error Code: null; Request ID: null; Proxy: null)
任何的意见都将会有帮助。
我在查询共享表时遇到了同样的错误。 我们缺少 AWS LakeFormation 权限。 将lakeformation:GetDataAccess
添加到 IAM 角色(除了胶水和 s3 权限)解决了这个问题。 长表:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Lakeformation",
"Effect": "Allow",
"Action": [
"lakeformation:GetDataAccess",
"lakeformation:GrantPermissions"
],
"Resource": "*"
}
]
}
我不确定为什么这是必要的,但我相信对 Glue 数据目录(Hive Metastore)的访问是通过 Lake Formation 提供的。 当 athena 连接器无法访问元数据时,它会抛出这个有点误导的错误消息。 我用胶水权限搜索问题太久了。
可以在AWS Lake Formation 数据访问文档中找到更多背景信息。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.