简体   繁体   中英

AWS Kendra PreHook Lambdas for Data Enrichment

I am working on a POC using Kendra and Salesforce. The connector allows me to connect to my Salesforce Org and index knowledge articles. I have been able to set this up and it is currently working as expected.

There are a few custom fields and data points I want to bring over to help enrich the data even more. One of these is an additional answer / body that will contain key information for the searching.

This field in my data source is rich text containing HTML and is often larger than 2048 characters, a limit that seems to be imposed in a String data field within Kendra.

I came across two hooks that are built in for Pre and Post data enrichment. My thought here is that I can use the pre hook to strip HTML tags and truncate the field before it gets stored in the index.

Hook Reference: https://docs.aws.amazon.com/kendra/latest/dg/API_CustomDocumentEnrichmentConfiguration.html

Current Setup:

I have added a new field to the index called sf_answer_preview . I then mapped this field in the data source to the rich text field in the Salesforce org.

If I run this as is, it will index about 200 of the 1,000 articles and give an error that the remaining articles exceed the 2048 character limit in that field, hence why I am trying to set up the enrichment.

在此处输入图像描述

I set up the above enrichment on my data source. I specified a lambda to use in the pre-extraction, as well as no additional filtering, so run this on every article. I am not 100% certain what the S3 bucket is for since I am using a data source, but it appears to be needed so I have added that as well.

For my lambda, I create the following:

exports.handler = async (event) => {

    // Debug
    console.log(JSON.stringify(event))
    
    // Vars
    const s3Bucket = event.s3Bucket;
    const s3ObjectKey = event.s3ObjectKey;
    const meta = event.metadata;
    
    // Answer
    const answer = meta.attributes.find(o => o.name === 'sf_answer_preview');

    // Remove HTML Tags
    const removeTags = (str) => {
        if ((str===null) || (str===''))
            return false;
        else
            str = str.toString();
        return str.replace( /(<([^>]+)>)/ig, '');
    }

    // Truncate
    const truncate = (input) => input.length > 2000 ? `${input.substring(0, 2000)}...` : input;
    let result = truncate(removeTags(answer.value.stringValue));
    
    // Response
    const response = {
        "version" : "v0",
        "s3ObjectKey": s3ObjectKey,
        "metadataUpdates": [
            {"name":"sf_answer_preview", "value":{"stringValue":result}}
        ]
    }
    
    // Debug
    console.log(response)

    // Response
    return response
};

Based on the contract for the lambda described here , it appears pretty straight forward. I access the event, find the field in the data called sf_answer_preview (the rich text field from Salesforce) and I strip and truncate the value to 2,000 characters.

For the response, I am telling it to update that field to the new formatted answer so that it complies with the field limits.

When I log the data in the lambda, the pre-extraction event details are as follows:

{
    "s3Bucket": "kendrasfdev",
    "s3ObjectKey": "pre-extraction/********/22736e62-c65e-4334-af60-8c925ef62034/https://*********.my.salesforce.com/ka1d0000000wkgVAAQ",
    "metadata": {
        "attributes": [
            {
                "name": "_document_title",
                "value": {
                    "stringValue": "What majors are under the Exploratory track of Health and Life Sciences?"
                }
            },
            {
                "name": "sf_answer_preview",
                "value": {
                    "stringValue": "A complete list of majors affiliated with the Exploratory Health and Life Sciences track is available <a href=\"https://cls.asu.edu/exploratory-health-and-life-sciences\" target=\"_blank\">online</a>.  This track allows you to explore a variety of majors related to the health and life science professions. For more information, please visit the <a href=\"https://cls.asu.edu/exploratory\" target=\"_blank\">Exploratory program</a> description. "
                }
            },
            {
                "name": "_data_source_sync_job_execution_id",
                "value": {
                    "stringValue": "0fbfb959-7206-4151-a2b7-fce761a46241"
                }
            },
        ]
    }
}

The Problem:

When this runs, I am still getting the same field limit error that the content exceeds the character limit. When I run the lambda on the raw data, it strips and truncates it as expected. I am thinking that the response in the lambda for some reason isn't setting the field value to the new content correctly and still trying to use the data directly from Salesforce, thus throwing the error.

Has anyone set up lambdas for Kendra before that might know what I am doing wrong? This seems pretty common to be able to do things like strip PII information before it gets indexed, so I must be slightly off on my setup somewhere.

Any thoughts?

since you are still passing the rich text as a metadata filed of a document, the character limit still applies so the document would fail at validation step of the API call and would not reach the enrichment step. A work around is to somehow append those rich text fields to the body of the document so that your lambda can access it there. But if those fields are auto generated for your documents from your data sources, that might not be easy.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM