使用数据源作为索引内 JSON 文档中的字段创建索引器

Question

I have an Index containing Document in JSON format in Azure Search Service.我在 Azure 搜索服务中有一个包含 JSON 格式文档的索引。

Index Schema索引架构

{
"name": "product-api",
"defaultScoringProfile": null,
"fields": [
    {
        "name": "upcid",
        "type": "Edm.String",
        "searchable": true,
        "filterable": false,
        "retrievable": true,
        "sortable": true,
        "facetable": false,
        "key": true,
        "indexAnalyzer": null,
        "searchAnalyzer": null,
        "analyzer": null,
        "synonymMaps": []
    },
    {
        "name": "productName",
        "type": "Edm.String",
        "searchable": true,
        "filterable": false,
        "retrievable": true,
        "sortable": false,
        "facetable": false,
        "key": false,
        "indexAnalyzer": null,
        "searchAnalyzer": null,
        "analyzer": null,
        "synonymMaps": []
    },
    {
        "name": "imageUrl",
        "type": "Edm.String",
        "searchable": false,
        "filterable": false,
        "retrievable": true,
        "sortable": false,
        "facetable": false,
        "key": false,
        "indexAnalyzer": null,
        "searchAnalyzer": null,
        "analyzer": null,
        "synonymMaps": []
    },
    {
        "name": "ocrText",
        "type": "Edm.String",
        "searchable": false,
        "filterable": false,
        "retrievable": true,
        "sortable": false,
        "facetable": false,
        "key": false,
        "indexAnalyzer": null,
        "searchAnalyzer": null,
        "analyzer": null,
        "synonymMaps": []
    }
],
"scoringProfiles": [],
"corsOptions": {
    "allowedOrigins": [
        "*"
    ],
    "maxAgeInSeconds": null
},
"suggesters": [],
"analyzers": [],
"tokenizers": [],
"tokenFilters": [],
"charFilters": [],
"encryptionKey": null,
"similarity": {
    "@odata.type": "#Microsoft.Azure.Search.ClassicSimilarity"
}
}

My requirement我的要求

Create an Indexer which could use the imageUrl (image not stored in azure storage service) field as data source, Microsoft.Skills.Vision.OcrSkill as a skill and maps the output to field ocrText .创建一个索引器，它可以使用imageUrl （图像未存储在 azure 存储服务中）字段作为数据源，将Microsoft.Skills.Vision.OcrSkill作为技能并将输出映射到字段ocrText 。

Problem问题

From what I have read from the docs, the data source (in my case, image) must be in Azure Blob Storage to create Indexer.根据我从文档中读到的内容，数据源（在我的例子中是图像）必须在Azure Blob 存储中才能创建索引器。

Have anyone done something similar to my requirement?有没有人做过类似我要求的事情？ Or does anyone know any direct or indirect method to achieve the requirement?或者有没有人知道任何直接或间接的方法来达到要求？

It would be great if any leads are provided, I could not find anything related to this on the Internet.如果提供任何线索就太好了，我在互联网上找不到任何与此相关的内容。

Answer 1

How did you populate the imageUrl data in the search index to begin with?您是如何开始在搜索索引中填充 imageUrl 数据的？

I'm asking because there's no way to configure an Indexer to ingest data from a search index as the data source.我问是因为没有办法配置索引器来从搜索索引中提取数据作为数据源。 If you are able to put those image urls somewhere else (eg: blob storage), you could point an Indexer at that.如果您能够将这些图像 url 放在其他地方（例如：blob 存储），您可以指向一个索引器。 If it's another Indexer that's populating the source index to begin with, you can add a knowledge store to that primary Indexer to sink the imageUrl data to blob/table storage as well as the search index.如果是另一个索引器开始填充源索引，您可以向该主索引器添加知识存储，以将 imageUrl 数据接收到 blob/表存储以及搜索索引。 Or, just process the url in the primary Indexer's skillset and don't bother with this secondary pass!或者，只需处理主索引器技能组中的 url，而不要理会这个次要传递！

The next issue is that Indexer's won't crawl arbitrary urls that you provide it.下一个问题是 Indexer 不会抓取您提供的任意网址。 It only ingests data from the datasource, or returned to it by a skill.它只从数据源中摄取数据，或由技能返回给它。 It is possible to write a custom web api skill that will take the url as input, download the image from that url, and respond to the indexer with the binary image data.可以编写一个自定义的 web api 技能，将 url 作为输入，从该 url 下载图像，并使用二进制图像数据响应索引器。 This functionality is not very well documented, but there exists an example powerskill that does something along those lines that you could more or less copy.这个功能没有很好的文档记录，但是有一个示例 powerskill可以做一些你可以或多或少复制的事情。

The rest of this secondary Indexer's pipeline should be pretty straight forward (add an ocr skill, and output field mapping functions to merge the data back into the same index).这个辅助索引器管道的其余部分应该非常简单（添加 ocr 技能和输出字段映射函数以将数据合并回同一索引）。 The indexer won't override existing values with nulls, so just make sure to only map the one new field back to the index, and the rest of the index's data will remain unchanged.索引器不会用空值覆盖现有值，因此只需确保仅将一个新字段映射回索引，而索引的其余数据将保持不变。

使用数据源作为索引内 JSON 文档中的字段创建索引器

问题描述

1 个解决方案

解决方案1
1 2021-06-23 23:01:41

使用数据源作为索引内 JSON 文档中的字段创建索引器

问题描述

1 个解决方案

解决方案1 1 2021-06-23 23:01:41

解决方案1
1 2021-06-23 23:01:41