简体   繁体   English

在AWS Glue中使用纯python外部库时HDFS中的权限错误

[英]Permission error in HDFS when using pure python external library in AWS Glue

I tried to run a customized Python script that imports an external pure python library (psycopg2) on AWS Glue but failed. 我尝试运行定制的Python脚本,在AWS Glue上导入外部纯python库(psycopg2)但失败了。 I checked the CloudWatch log and found out the reason for the failure is that: 我检查了CloudWatch日志,发现失败的原因是:

Spark failed the permission check on several folders in HDFS, one of them contains the external python library I uploaded to S3 (s3://path/to/psycopg2) which requires -x permission: Spark无法检查HDFS中的几个文件夹,其中一个包含我上传到S3的外部python库(s3:// path / to / psycopg2),它需要-x权限:

org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=READ_EXECUTE, inode="/user/root/.sparkStaging/application_1507598924170_0002/psycopg2":root:hadoop:drw-r--r--
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:320)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1728)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1712)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1686)
at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getListingInt(FSDirStatAndListingOp.java:76)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getListing(FSNamesystem.java:4486)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getListing(NameNodeRpcServer.java:999)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getListing(ClientNamenodeProtocolServerSideTranslatorPB.java:634)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045)

I make sure that the library contains only .py file as instructed in the AWS documentation. 我确保该库仅包含AWS文档中指示的.py文件。

Does anyone know what went wrong? 有谁知道出了什么问题?

Many thanks! 非常感谢!

You have a directory that doesn't have execute permission. 您有一个没有执行权限的目录 In a Unix-based O/S directories must have the execute bit set (for at least the user) to be usable. 在基于Unix的O / S目录中,必须设置执行位(至少对于用户而言)才能使用。

Run something like 运行类似的东西

sudo chmod +x /user/root/.sparkStaging/application_1507598924170_0002/psycopg2

and try it again. 然后再试一次。

Glue仅支持仅python库,即没有任何特定的本机库绑定。

The package psycopg2 is not pure Python, so it will not work with Glue. 包psycopg2不是纯Python,所以它不适用于Glue。 From the setup.py: 从setup.py:

If you prefer to avoid building psycopg2 from source, please install the PyPI 'psycopg2-binary' package instead. 如果您希望避免从源代码构建psycopg2,请安装PyPI'psycopg2-binary'软件包。

From the AWS Glue documentation : 来自AWS Glue文档

You can use Python extension modules and libraries with your AWS Glue ETL scripts as long as they are written in pure Python. 您可以将Python扩展模块和库与AWS Glue ETL脚本一起使用,只要它们是用纯Python编写的。 C libraries such as pandas are not supported at the present time, nor are extensions written in other languages. 目前不支持诸如pandas之类的C库,也不支持用其他语言编写的扩展。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM