简体   繁体   English

如何通过 Java 以编程方式在 azure 搜索中设置认知搜索功能(使用 OCR)?

[英]How to set up cognitive search capabilities(with OCR) in azure search programmatically through Java?

I want to provide Full Text Search capabilities in my application and so i am trying to configure Azure Search with Cognitive Search capabilities, so that i can index image as well as non-image documents stored in Azure Blob Storage.我想在我的应用程序中提供全文搜索功能,因此我尝试使用认知搜索功能配置 Azure 搜索,以便我可以索引存储在 Azure Blob 存储中的图像和非图像文档。 However, while configuring Azure Search through Java code using Azure Search's REST APIs, i am not able to leverage OCR capabilities into Azure Search and the image documents are not getting indexed. However, while configuring Azure Search through Java code using Azure Search's REST APIs, i am not able to leverage OCR capabilities into Azure Search and the image documents are not getting indexed. I am missing some configuration details while configuring Azure search through Java code(using Azure Search REST APIs). I am missing some configuration details while configuring Azure search through Java code(using Azure Search REST APIs).

Case 1: From Azure Portal, i am able案例 1:从 Azure 门户,我能够

  1. To configure Azure search with cognitive capabilities(including OCR skillset), Index, indexer and Azure Blob Storage.使用认知功能(包括 OCR 技能组)、索引、索引器和 Azure Blob 存储配置 Azure 搜索。
  2. To index image and non-image documents such as pdf, png, jpg, xls etc.索引图像和非图像文档,例如 pdf、png、jpg、xls 等。
  3. To search the indexed documents搜索索引文档

Case 2: From Java code by using Azure REST APIs, i am able案例 2:从 Java 代码使用 Azure REST API,我能够

  1. To configure Azure search with cognitive capabilities, Index, indexer and Azure Blob Storage.使用认知功能、索引、索引器和 Azure Blob 存储配置 Azure 搜索。
  2. To index non-image documents such as pdf, xls etc.索引pdf、xls等非图像文档。
  3. To search the indexed documents However, while configuring Azure Search through Java code using Azure Search's REST APIs(in case 2), i am not able to leverage OCR capabilities into Azure Search and the image documents are not getting indexed. To search the indexed documents However, while configuring Azure Search through Java code using Azure Search's REST APIs(in case 2), i am not able to leverage OCR capabilities into Azure Search and the image documents are not getting indexed. I am missing some configuration details while configuring Azure search through Java code(using Azure Search REST APIs). I am missing some configuration details while configuring Azure search through Java code(using Azure Search REST APIs).

I am using following sample Azure Search Rest API's from Java code 1. https://%s.search.windows.net/datasources?api-version=%s 2. https://%s.search.windows.net/skillsets/cog-search-demo-ss?api-version=%s 3. https://%s.search.windows.net/indexes/%s?api-version=%s 4. https://%s.search.windows.net/indexers?api-version=%s I am using following sample Azure Search Rest API's from Java code 1. https://%s.search.windows.net/datasources?api-version=%s 2. https://%s.search.windows.net/技能集/cog-search-demo-ss?api-version=%s 3. https://%s.search.windows.net/indexes/%s?api-version=%s 4. https://%s .search.windows.net/indexers?api-version=%s

Configuration jsons: 1. datasource.json配置jsons:1.datasource.json

{
   "name" : "csstoragetest",
    "type" : "azureblob",
    "credentials" : { "connectionString" : "connectionString" },
    "container" : { "name" : "csblob" }
}
  1. skillset.json技能组.json
{
   "description": "Extract text from images and merge with content text to produce merged_text",
  "skills":
  [
    {
      "description": "Extract text (plain and structured) from image.",
      "@odata.type": "#Microsoft.Skills.Vision.OcrSkill",
      "context": "/document/normalized_images/*",
      "defaultLanguageCode": "null",
      "detectOrientation": true,
      "inputs": [
        {
          "name": "image",
          "source": "/document/normalized_images/*"
        }
      ],
      "outputs": [
        {
          "name": "text",
          "targetName": "myText"
        },
        {
          "name": "layoutText",
          "targetName": "myLayoutText"
        }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.MergeSkill",
      "description": "Create merged_text, which includes all the textual representation of each image inserted at the right location in the content field.",
      "context": "/document",
      "insertPreTag": " ",
      "insertPostTag": " ",
      "inputs": [
        {
          "name":"text", "source": "/document/content"
        },
        {
          "name": "itemsToInsert", "source": "/document/normalized_images/*/text"
        },
        {
          "name":"offsets", "source": "/document/normalized_images/*/contentOffset"
        }
      ],
      "outputs": [
        {
          "name": "mergedText", "targetName" : "merged_text"
        }
      ]
    }
  ]
}
  1. index.json索引.json
{
  "name": "azureblob-indexing",
  "fields": [
    { "name": "id", "type": "Edm.String", "key": true, "searchable": false },
    { "name": "content", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": false, "facetable": false }
  ]
}
  1. indexer.json索引器.json
{
    "name" : "azureblob-indexing1",
  "dataSourceName" : "csstoragetest",
  "targetIndexName" : "azureblob-indexing",
  "schedule" : { "interval" : "PT2H" },
  "skillsetName" : "cog-search-demo-ss",
  "parameters":
  {
    "maxFailedItems":-1,
    "maxFailedItemsPerBatch":-1,
    "configuration":
    {
      "dataToExtract": "contentAndMetadata",
      "imageAction":"generateNormalizedImages",
      "parsingMode": "default",
      "firstLineContainsHeaders": false,
      "delimitedTextDelimiter": ","
    }
  }
}

After configuring Azure search through java code, the Image documents should get indexed in azure search and i should be able to search them base on the text contained in them.通过 java 代码配置 Azure 搜索后,图像文档应该在 azure 搜索中被索引,我应该能够根据其中包含的文本进行搜索。

Try setting the default language code to null without the quotes in skillset.json :尝试将默认语言代码设置为 null ,而无需在技能组中使用引号。json :

"defaultLanguageCode": null

I have figured out the configurations needed by myself.我已经弄清楚了自己需要的配置。 It required matching all the parameters between case 1 & 2 as stated above(in the question) and then updating the configuration jsons.如上所述(在问题中),它需要匹配案例 1 和 2 之间的所有参数,然后更新配置 json。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM