简体   繁体   中英

Create indexer with data source as a field in JSON document inside Index

I have an Index containing Document in JSON format in Azure Search Service.

Index Schema

{
"name": "product-api",
"defaultScoringProfile": null,
"fields": [
    {
        "name": "upcid",
        "type": "Edm.String",
        "searchable": true,
        "filterable": false,
        "retrievable": true,
        "sortable": true,
        "facetable": false,
        "key": true,
        "indexAnalyzer": null,
        "searchAnalyzer": null,
        "analyzer": null,
        "synonymMaps": []
    },
    {
        "name": "productName",
        "type": "Edm.String",
        "searchable": true,
        "filterable": false,
        "retrievable": true,
        "sortable": false,
        "facetable": false,
        "key": false,
        "indexAnalyzer": null,
        "searchAnalyzer": null,
        "analyzer": null,
        "synonymMaps": []
    },
    {
        "name": "imageUrl",
        "type": "Edm.String",
        "searchable": false,
        "filterable": false,
        "retrievable": true,
        "sortable": false,
        "facetable": false,
        "key": false,
        "indexAnalyzer": null,
        "searchAnalyzer": null,
        "analyzer": null,
        "synonymMaps": []
    },
    {
        "name": "ocrText",
        "type": "Edm.String",
        "searchable": false,
        "filterable": false,
        "retrievable": true,
        "sortable": false,
        "facetable": false,
        "key": false,
        "indexAnalyzer": null,
        "searchAnalyzer": null,
        "analyzer": null,
        "synonymMaps": []
    }
],
"scoringProfiles": [],
"corsOptions": {
    "allowedOrigins": [
        "*"
    ],
    "maxAgeInSeconds": null
},
"suggesters": [],
"analyzers": [],
"tokenizers": [],
"tokenFilters": [],
"charFilters": [],
"encryptionKey": null,
"similarity": {
    "@odata.type": "#Microsoft.Azure.Search.ClassicSimilarity"
}
}
  • My requirement

Create an Indexer which could use the imageUrl (image not stored in azure storage service) field as data source, Microsoft.Skills.Vision.OcrSkill as a skill and maps the output to field ocrText .

  • Problem

From what I have read from the docs, the data source (in my case, image) must be in Azure Blob Storage to create Indexer.

Have anyone done something similar to my requirement? Or does anyone know any direct or indirect method to achieve the requirement?

It would be great if any leads are provided, I could not find anything related to this on the Internet.

How did you populate the imageUrl data in the search index to begin with?

I'm asking because there's no way to configure an Indexer to ingest data from a search index as the data source. If you are able to put those image urls somewhere else (eg: blob storage), you could point an Indexer at that. If it's another Indexer that's populating the source index to begin with, you can add a knowledge store to that primary Indexer to sink the imageUrl data to blob/table storage as well as the search index. Or, just process the url in the primary Indexer's skillset and don't bother with this secondary pass!

The next issue is that Indexer's won't crawl arbitrary urls that you provide it. It only ingests data from the datasource, or returned to it by a skill. It is possible to write a custom web api skill that will take the url as input, download the image from that url, and respond to the indexer with the binary image data. This functionality is not very well documented, but there exists an example powerskill that does something along those lines that you could more or less copy.

The rest of this secondary Indexer's pipeline should be pretty straight forward (add an ocr skill, and output field mapping functions to merge the data back into the same index). The indexer won't override existing values with nulls, so just make sure to only map the one new field back to the index, and the rest of the index's data will remain unchanged.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM