简体   繁体   English

在通过 boto3 发送到 AWS Comprehend 之前如何按大小过滤文档?

[英]How Do You Filter Documents by Size Before Sending to AWS Comprehend via boto3?

I'm currently attempting to use the boto3 library to perform batch sentiment analysis on a collection of documents with AWS' Comprehend service.我目前正在尝试使用 boto3 库通过 AWS 的 Comprehend 服务对一组文档执行批量情绪分析。 The service has some limitations in document size (documents cannot exceed 5000 bytes);该服务对文档大小有一些限制(文档不能超过 5000 字节); therefore, I'm attempting to pre-filter documents before using the boto3 API.因此,我尝试在使用 boto3 API 之前对文档进行预过滤。 See the code snippet below:请参阅下面的代码片段:

...
batch = []
for doc in docs:
    if isinstance(doc, str) and len(doc) > 0 and sys.getsizeof(doc) < 5000:
        batch.append(doc)

data = self.client.batch_detect_sentiment(TextList=batch, LanguageCode=language)
...

My assumption was that trying to filter documents by using sys.getsizeof would result in filtering out any strings which would go beyond the 5000 byte limit of the service.我的假设是,尝试使用sys.getsizeof过滤文档会导致过滤掉任何 go 超出服务 5000 字节限制的字符串。 However, I'm still receiving the following exception with my filtering:但是,我的过滤仍然收到以下异常:

botocore.errorfactory.TextSizeLimitExceededException: An error occurred (TextSizeLimitExceededException) when calling the BatchDetectSentiment operation: Input text size exceeds limit. Max length of request text allowed is 5000 bytes while in this request the text size is 5523 bytes

Is there a more effective way to calculate the size of a document sent to Comprehend for the purpose of avoiding hitting the max document size limit?为了避免达到最大文档大小限制,是否有更有效的方法来计算发送到 Comprehend 的文档大小?

There are 2 approaches here:这里有两种方法:

  1. As Daniel mentioned you can use len(doc.encode('utf-8')) to determine the end size of string as it takes into account the encoding, not just how much memory the python string object takes.正如丹尼尔所提到的,您可以使用len(doc.encode('utf-8'))来确定字符串的结束大小,因为它考虑了编码,而不仅仅是 memory python 字符串 ZA8CFDE6331BD59EB2AC96F8911C4 需要多少。

  2. You can handle the exception whenever it occurs.您可以在异常发生时处理它。 Just like that:就像这样:

try:
    data = self.client.batch_detect_sentiment(TextList=batch, LanguageCode=language)
except self.client.exceptions.TextSizeLimitExceededException:
    print('The batch was too long')
else:
    print(data)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM