简体   繁体   中英

How to clear all data from AWS CloudSearch?

I have an AWS CloudSearch instance that I am still developing.

At times, such as when I make some modification to the format of a field, I find myself wanting to wipe out all of the data and regenerating it.

Is there any way to clear out all of the data using the console, or do I have to go about it by programatic means?

If I do have to use programatic means (ie generate and POST a bunch of "delete" SDF files) is there any good way to query for all documents in a CloudSearch instance?

I guess I could just delete and re-create the instance, but thattakes a while, and loses all of the indexes/rank expressions/text options/etc

Using aws and jq from the command line (tested with bash on mac):

export CS_DOMAIN=https://yoursearchdomain.yourregion.cloudsearch.amazonaws.com

# Get ids of all existing documents, reformat as
# [{ type: "delete", id: "ID" }, ...] using jq
aws cloudsearchdomain search \
  --endpoint-url=$CS_DOMAIN \
  --size=10000 \
  --query-parser=structured \
  --search-query="matchall" \
  | jq '[.hits.hit[] | {type: "delete", id: .id}]' \
  > delete-all.json

# Delete the documents
aws cloudsearchdomain upload-documents \
  --endpoint-url=$CS_DOMAIN \
  --content-type='application/json' \
  --documents=delete-all.json

For more info on jq see Reshaping JSON with jq

Update Feb 22, 2017

Added --size to get the maximum number of documents (10,000) at a time. You may need to repeat this script multiple times. Also, --search-query can take something more specific, if you want to be selective about the documents getting deleted.

Best answer I've been able to find was somewhat buried in the AWS docs. To wit:

Amazon CloudSearch currently does not provide a mechanism for deleting all of the documents in a domain. However, you can clone the domain configuration to start over with an empty domain. For more information, see Cloning an Existing Domain's Indexing Options .

Source: http://docs.aws.amazon.com/cloudsearch/latest/developerguide/Troubleshooting.html#ts.cleardomain

On my side, I used a local nodejs script like this:

var AWS = require('aws-sdk');

AWS.config.update({
    accessKeyId: '<your AccessKey>', 
    secretAccessKey: '<Your secretAccessKey>',
    region: '<your region>',
    endpoint: '<your CloudSearch endpoint'
});

var params = {
       query:"(or <your facet.FIELD:'<one facet value>' facet.FIELD:'<one facet value>')",
       queryParser:'structured'
};


var cloudsearchdomain = new AWS.CloudSearchDomain(params);
cloudsearchdomain.search(params, function(err, data) {
    var fs = require('fs');
    var result = [];
    if (err) {
        console.log("Failed");
        console.log(err);
    } else {
        resultMessage = data;
        for(var i=0;i<data.hits.hit.length;i++){
            result.push({"type":"delete","id":data.hits.hit[i].id});
        }    

        fs.writeFile("delete.json", JSON.stringify(result), function(err) {
            if(err) {return console.log(err);}
        console.log("The file was saved!");
        });
    }
});

You have to know at least all the values of on facets, to be able to request all IDs. In my code, I just put 2 (in (or ....) section), but you can have more.

Once it is done, you have one delete.json file to be used with AWS-CLI using this command :

aws cloudsearchdomain upload-documents --documents delete.json --content-type application/json --endpoint-url <your CloudSearch endpoint>

... that did the job for me !

Nekloth

I've been doing the following, using the python adapter, boto, to empty cloudsearch. Not pretty but it gets the job done. The hard part is balancing the amount you fetch is within the cloudsearch 5mb limitation.

    count = CloudSearchAdaptor.Instance().get_total_documents()
    while count > 0:
         results = CloudSearchAdaptor.Instance().search("lolzcat|-lolzcat", 'simple', 1000)
         for doc in results.docs:
             CloudSearchAdaptor.Instance().delete(doc['id'])

         CloudSearchAdaptor.Instance().commit()
         #add delay here if cloudsearch takes to long to propigate delete change            
         count = CloudSearchAdaptor.Instance().get_total_documents()

Cloudsearch adapter class looks something like the following:

from boto.cloudsearch2.layer2 import Layer2
from singleton import Singleton

@Singleton
class CloudSearchAdaptor:

    def __init__(self):
        layer2 = Layer2(
            aws_access_key_id='AWS_ACCESS_KEY_ID',
            aws_secret_access_key='AWS_SECRET_ACCESS_KEY',
            region='AWS_REGION'
        )
        self.domain = layer2.lookup('AWS_DOMAIN'))
        self.doc_service = self.domain.get_document_service()
        self.search_service = self.domain.get_search_service()

@staticmethod
def delete(id):
    instance = CloudSearchAdaptor.Instance()
    try:
        response = instance.doc_service.delete(id)
    except Exception as e:
        print 'Error deleting to CloudSearch'

@staticmethod
def search(query, parser='structured', size=1000):
    instance = CloudSearchAdaptor.Instance()
    try:
        results = instance.search_service.search(q=query, parser=parser, size=size)
        return results
    except Exception as e:
        print 'Error searching CloudSearch'

@staticmethod
def get_total_documents():
    instance = CloudSearchAdaptor.Instance()
    try:
        results = instance.search_service.search(q='matchall', parser='structured', size=0)
        return results.hits
    except Exception as e:
        print 'Error getting total documents from CloudSearch'

@staticmethod
def commit():
    try:
        response = CloudSearchAdaptor.Instance().doc_service.commit()
        CloudSearchAdaptor.Instance().doc_service.clear_sdf()
    except Exception as e:
        print 'Error committing to CloudSearch'

You can manually upload document batch directly to AWS CloudSearch, Dashboard > Upload Document. If you can enumerate all the index id's you want to delete you can create a script to generate document batch or generate it manually.

document batch format should be like this

sample.json

[
    {
        "type": "delete",
        "id": "1"
    },
    {
        "type": "delete",
        "id": "2"
    }
]

How to enumerate all index - Run a test search

  • Search: id:* (or any field you sure will be available to all)
  • Query Parser: Lucene

On PHP, I managed to create a script for cleaning all records using the AWS PHP SDK:

clean.php - http://pastebin.com/Lkyk1D1i config.php - http://pastebin.com/kFkZhxCc

You'll need to configure your keys on config.php, and your endpoints on clean.php, download the AWS PHP SDK, and you're good to go!!!

Note it'll only clean 10000 documents max. as Amazon has got a limit.

I've managed to create a PowerShell script for it. Check my website here: http://www.mpustelak.com/2017/01/aws-cloudsearch-clear-domain-using-powershell/

Script:

$searchUrl = '[SEARCH_URL]'
$documentUrl = '[DOCUMENT_URL]'
$parser = 'Lucene'
$querySize = 500

function Get-DomainHits()
{
    (Search-CSDDocuments -ServiceUrl $searchUrl -Query "*:*" -QueryParser $parser -Size $querySize).Hits;
}

function Get-TotalDocuments()
{
    (Get-DomainHits).Found
}

function Delete-Documents()
{
    (Get-DomainHits).Hit | ForEach-Object -begin { $batch = '[' } -process { $batch += '{"type":"delete","id":' + $_.id + '},'} -end { $batch = $batch.Remove($batch.Length - 1, 1); $batch += ']' }

    Try
    {
        Invoke-WebRequest -Uri $documentUrl -Method POST -Body $batch -ContentType 'application/json'
    }
    Catch
    {
        $_.Exception
        $_.Exception.Message
    }
}

$total = Get-TotalDocuments
while($total -ne 0)
{
    Delete-Documents

    $total = Get-TotalDocuments

    Write-Host 'Documents left:'$total
    # Sleep for 1 second to give CS time to delete documents
    sleep 1
}

Java version below to clear all data within a cloud search domain:

private static final AmazonCloudSearchDomain cloudSearch = Region
        .getRegion(Regions.fromName(CommonConfiguration.REGION_NAME))
        .createClient(AmazonCloudSearchDomainClient.class, null, null)
        .withEndpoint(CommonConfiguration.SEARCH_DOMAIN_DOCUMENT_ENDPOINT);

public static void main(String[] args) {

    // retrieve all documents from cloud search
    SearchRequest searchRequest = new SearchRequest().withQuery("matchall").withQueryParser(QueryParser.Structured);
    Hits hits = cloudSearch.search(searchRequest).getHits();

    if (hits.getFound() != 0) {
        StringBuffer sb = new StringBuffer();
        sb.append("[");

        int i = 1;
        // construct JSON to delete all
        for (Hit hit : hits.getHit()) {
            sb.append("{\"type\": \"delete\",  \"id\": \"").append(hit.getId()).append("\"}");
            if (i < hits.getHit().size()) {
                sb.append(",");
            }
            i++;
        }

        sb.append("]");

        // send to cloud search
        InputStream documents = IOUtils.toInputStream(sb.toString());
        UploadDocumentsRequest uploadDocumentsRequest = new UploadDocumentsRequest()
                .withContentType("application/json").withDocuments(documents).withContentLength((long) sb.length());
        cloudSearch.uploadDocuments(uploadDocumentsRequest);
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM