简体   繁体   English

如何链接扫描的文档及其文本内容以使其可搜索?

[英]How to link scanned document with its text content to make it searchable?

I have PDF documents containing several images/pages of scanned documents. 我有包含多个图像/扫描文档页面的PDF文档。 Their (OCR-produced) text content comes in separate XML files. 他们的(OCR生产的)文本内容位于单独的XML文件中。

Is it possible to use/link the text content from XML somehow to my PDF files? 是否可以使用XML文本内容/将其链接到我的PDF文件? (Ideally there would be no additional files left in the repository to confuse unaware users.) (理想情况下,存储库中将不会留下任何其他文件,以使不知情的用户感到困惑。)

As I've been told there's 65k limit on a text property, therefore I can't simply put the text content into a property on the , as the PDF might easily exceed that limit. 有人告诉我,文本属性的限制为65k,因此我不能简单地将文本内容放入的属性,因为PDF可能会轻易超过该限制。

A suggestion has been made to pass a stream with the text content to cm:content property of my PDF file. 有人建议将带有文本内容的流传递到我的PDF文件的cm:content属性。 I'm kinda lost here, as IMO that means that either I'm providing a reference or I'm assigning huge string again. 我在这里有点茫然,因为IMO意味着我正在提供参考或再次分配巨大的字符串。 The first would mean the text content has to be preserved somewhere as a separate document. 第一种意味着文本内容必须作为单独的文档保存在某个地方。 The later sounds like I would hit the 65k limit again. 后面的声音听起来像我会再次达到65k的限制。
Also I think setting cm:content would probably delete the PDF content itself. 我也认为设置cm:content可能会删除PDF内容本身。 I need the PDF binary data to remain untouched. 我需要PDF二进制数据保持不变。

This is where the suggestion is being discussed . 这是讨论建议的地方。 I'm currently trying that anyways. 无论如何,我目前正在尝试。

Soo, it is actually quite easy... What needs to be done is to define a property of type "d:content" on your document; 如此,这实际上很容易。需要做的是在文档上定义一个类型为“ d:content”的属性。 I do that via an aspect... 我通过一个方面来做到这一点...

model.xml: model.xml:

<aspects>
    <aspect name="mm:my_aspect">
...
            <property name="mm:myTextContentProperty">
                <type>d:content</type>
            </property>
        </properties>
    </aspect>
</aspects>

Then, when I have both PDF and its text representation in the repository, I link those two by adding the aspect and populating the property... 然后,当我在存储库中同时拥有PDF及其文本表示形式时,我将通过添加方面并填充属性来链接这两者...

getNodeService().addAspect(pdfNodeRef, myAspect, null);
getNodeService().setProperty(pdfNodeRef, MyModel.MY_TEXT_CONTENT_PROPERTY, new ContentData("store://....bin", "text/plain", size, "UTF-8"));

Now the PDF can be found via both following queries even though it does not contain any text data... 现在,即使不包含任何文本数据,也可以通过以下两个查询找到PDF ...

"@\\{http\\://mymodel.ns/content/1.0\\}myTextContentProperty:\"" + string + "\""
"TEXT:\"" + string + "\""

The later is also hinted here , and I guess that is how regular search in Alfresco Web Client works, because now the PDF is reachable using the regular search input. 这里也暗示后者,我想这就是Alfresco Web Client中常规搜索的工作原理,因为现在可以使用常规搜索输入访问PDF。
There is one issue though: the search spits the PDF document and also the document I link using the property. 但是,存在一个问题:搜索会吐出PDF文档以及我使用该属性链接的文档。 So now I need to hide the later from search results... 所以现在我需要在搜索结果中隐藏后面的内容...

(When searching using the first query only the PDF is found, as expected; but that approach is of little use to me.) (在使用第一个查询进行搜索时,仅按预期方式找到了PDF;但是这种方法对我没有多大用处。)

Hopefully it saves some time to other Alfresco-newbies. 希望它可以节省其他Alfresco新手的时间。 :) :)

Another way to achieve what I need would be setting MY_TEXT_CONTENT_PROPERTY using contentService... 实现我需要的另一种方法是使用contentService设置MY_TEXT_CONTENT_PROPERTY ...

ContentWriter writer = getContentService().getWriter(pdfNodeRef, MyModel.MY_TEXT_CONTENT_PROPERTY, true);
writer.setMimetype("text/plain");
writer.setEncoding("UTF-8");
writer.putContent(stringFromXmlDescription); // the source XML gets thrown away

(Important thing seems to be to put the content after the mimetype and encoding are set. Otherwise the content/property is not searchable.) (重要的事情似乎是MIME类型和编码都设置后,把内容,否则,内容/属性不搜索。)

With this approach there's no need to hide the linked text documents, there aren't any. 使用这种方法,不需要隐藏链接的文本文档,没有任何东西。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在Elasticsearch中有效存储文本内容并使之可搜索 - How to efficiently store the text content in Elasticsearch and make it searchable 如何使用Java将文本插入扫描的pdf文档 - how to insert text into a scanned pdf document using java 如何使用pdfbox 2.0.0在扫描的文档中检测OCR? - How to detect OCR in a scanned Document with pdfbox 2.0.0? 如何以编程方式读取扫描的文档或图像 - How to programmatically read over a scanned document or image 如何在 Postgres 中保存可搜索和可查询的 json 文档? - How to save a searchable and queryable json document in Postgres? 如何区分扫描文本中的文本和数字? - How to differentiate between text and number in scanned text? 如何在Java swing中使用按钮创建带有表单内容的文本文档? - How do I make a button create a text document with the content of a form in java swing? 如何使用任何 Java 库使现有 PDF 文本可搜索? 使用 OCR - How to Make Existing PDF Text Searchable using any Java Library? With OCR 如何执行:上载图像&gt;识别文本&gt;使图像可搜索&gt;存储到数据库中? - How to perform: Upload Image > Recognize Text > Make Image Searchable > Store into DB? 如何获得整个节点及其文本内容? - How to get the entire node with its text content?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM