简体   繁体   English

NLTK数据可以安装在AWS Redshift环境中吗?

[英]Can NLTK Data be installed in AWS Redshift environment?

I am trying to create a Python user defined scalar (UDF) function in an AWS Redshift DB. 我正在尝试在AWS Redshift数据库中创建Python用户定义的标量(UDF)函数。 The UDF wraps the following Python code: UDF包装以下Python代码:

CREATE or replace library nltk language plpythonu from 's3://xxx/dev/python-libraries/nltk-3.2.1.zip'
credentials 'aws_access_key_id=xxx;aws_secret_access_key=yyy' region as 'eu-west-1';

CREATE or replace library textblob language plpythonu from 's3://xxx/dev/python-libraries/textblob-0.15.1-py2.py3-none-any.zip'
credentials 'aws_access_key_id=xxx;aws_secret_access_key=yyy' region as 'eu-west-1';

CREATE or replace FUNCTION f_sentiment_polarity (comment varchar(1000)) RETURNS float IMMUTABLE as $$
from textblob import TextBlob
return TextBlob(comment).sentiment.polarity
$$ LANGUAGE plpythonu;

SELECT f_sentiment_polarity('this would be very useful if the corpora were loaded');

f_sentiment_polarity
--------------------
                   0

The result of the select statement gives me 0 select语句的结果给我0

When I run the same Python code in a local environment (Python 2.7 on Windows with NLTK v3.2.5, I get 0.39 : 当我在本地环境中运行相同的Python代码时(在Windows中使用NLTK v3.2.5的Python 2.7中,我得到0.39

Python 2.7.10 (default, May 23 2015, 09:44:00) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from textblob import TextBlob
>>> TextBlob('this would be very useful if the corpora were loaded').sentiment.polarity
0.39
>>>

I presume that this is because the various NLTK Corpora have not been loaded in the AWS Redshift Python environment. 我认为这是因为各种NLTK Corpora尚未加载到AWS Redshift Python环境中。 Creating another Redshift UDF as follows seems to bear this out: 如下创建另一个Redshift UDF似乎可以证明这一点:

CREATE or replace FUNCTION f_num_brown_words () RETURNS int IMMUTABLE as $$
from nltk.corpus import brown
return len(brown.words())
$$ LANGUAGE plpythonu;

select f_num_brown_words();

ERROR: XX000: LookupError: 
**********************************************************************
  Resource u'corpora/brown' not found.  Please use the NLTK
  Downloader to obtain the resource:  >>> nltk.download()
  Searched in:
    - "'/'/nltk_data"
    - '/usr/shar

Question: Is there a way of loading the NLTK Corpora in the AWS Redshift Python environment so that my UDF will function correctly? 问:是否可以在AWS Redshift Python环境中加载NLTK Corpora,以便我的UDF能够正常运行?

You can load custom libraries onto your cluster, more info in the official docs . 您可以将自定义库加载到群集上,有关更多信息,请参见官方文档

I followed the instructions and it worked for me with another library. 我按照说明进行操作,它与另一个库一起工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM