简体   繁体   中英

How to set up cognitive search capabilities(with OCR) in azure search programmatically through Java?

I want to provide Full Text Search capabilities in my application and so i am trying to configure Azure Search with Cognitive Search capabilities, so that i can index image as well as non-image documents stored in Azure Blob Storage. However, while configuring Azure Search through Java code using Azure Search's REST APIs, i am not able to leverage OCR capabilities into Azure Search and the image documents are not getting indexed. I am missing some configuration details while configuring Azure search through Java code(using Azure Search REST APIs).

Case 1: From Azure Portal, i am able

  1. To configure Azure search with cognitive capabilities(including OCR skillset), Index, indexer and Azure Blob Storage.
  2. To index image and non-image documents such as pdf, png, jpg, xls etc.
  3. To search the indexed documents

Case 2: From Java code by using Azure REST APIs, i am able

  1. To configure Azure search with cognitive capabilities, Index, indexer and Azure Blob Storage.
  2. To index non-image documents such as pdf, xls etc.
  3. To search the indexed documents However, while configuring Azure Search through Java code using Azure Search's REST APIs(in case 2), i am not able to leverage OCR capabilities into Azure Search and the image documents are not getting indexed. I am missing some configuration details while configuring Azure search through Java code(using Azure Search REST APIs).

I am using following sample Azure Search Rest API's from Java code 1. https://%s.search.windows.net/datasources?api-version=%s 2. https://%s.search.windows.net/skillsets/cog-search-demo-ss?api-version=%s 3. https://%s.search.windows.net/indexes/%s?api-version=%s 4. https://%s.search.windows.net/indexers?api-version=%s

Configuration jsons: 1. datasource.json

{
   "name" : "csstoragetest",
    "type" : "azureblob",
    "credentials" : { "connectionString" : "connectionString" },
    "container" : { "name" : "csblob" }
}
  1. skillset.json
{
   "description": "Extract text from images and merge with content text to produce merged_text",
  "skills":
  [
    {
      "description": "Extract text (plain and structured) from image.",
      "@odata.type": "#Microsoft.Skills.Vision.OcrSkill",
      "context": "/document/normalized_images/*",
      "defaultLanguageCode": "null",
      "detectOrientation": true,
      "inputs": [
        {
          "name": "image",
          "source": "/document/normalized_images/*"
        }
      ],
      "outputs": [
        {
          "name": "text",
          "targetName": "myText"
        },
        {
          "name": "layoutText",
          "targetName": "myLayoutText"
        }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.MergeSkill",
      "description": "Create merged_text, which includes all the textual representation of each image inserted at the right location in the content field.",
      "context": "/document",
      "insertPreTag": " ",
      "insertPostTag": " ",
      "inputs": [
        {
          "name":"text", "source": "/document/content"
        },
        {
          "name": "itemsToInsert", "source": "/document/normalized_images/*/text"
        },
        {
          "name":"offsets", "source": "/document/normalized_images/*/contentOffset"
        }
      ],
      "outputs": [
        {
          "name": "mergedText", "targetName" : "merged_text"
        }
      ]
    }
  ]
}
  1. index.json
{
  "name": "azureblob-indexing",
  "fields": [
    { "name": "id", "type": "Edm.String", "key": true, "searchable": false },
    { "name": "content", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": false, "facetable": false }
  ]
}
  1. indexer.json
{
    "name" : "azureblob-indexing1",
  "dataSourceName" : "csstoragetest",
  "targetIndexName" : "azureblob-indexing",
  "schedule" : { "interval" : "PT2H" },
  "skillsetName" : "cog-search-demo-ss",
  "parameters":
  {
    "maxFailedItems":-1,
    "maxFailedItemsPerBatch":-1,
    "configuration":
    {
      "dataToExtract": "contentAndMetadata",
      "imageAction":"generateNormalizedImages",
      "parsingMode": "default",
      "firstLineContainsHeaders": false,
      "delimitedTextDelimiter": ","
    }
  }
}

After configuring Azure search through java code, the Image documents should get indexed in azure search and i should be able to search them base on the text contained in them.

Try setting the default language code to null without the quotes in skillset.json :

"defaultLanguageCode": null

I have figured out the configurations needed by myself. It required matching all the parameters between case 1 & 2 as stated above(in the question) and then updating the configuration jsons.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM