简体   繁体   English

大型文档上的实体提取

[英]Entity extraction on large documents

I have a need to extract entities from word and pdf documents. 我需要从Word和pdf文档中提取实体。 Documents can be in the range of 10 to 20 pages. 文件的范围可以是10到20页。 Are there scalable library/APIs available that we can plug into our processing pipeline? 是否有可扩展的库/ API可供我们插入处理管道中? Any comparative study of different solutions will be helpful. 任何对不同解决方案的比较研究都将有所帮助。

Take a look at the Watson Natural Language Understanding (you'll need to get an IBM ID and then login to see this content - don't worry , cost is $0). 看一下Watson自然语言理解 (您需要获取一个IBM ID,然后登录才能看到此内容-不用担心,费用为$ 0)。 With Watson Natural Language Understanding you will want to look at the API Explorer to find the correct API syntax to use to get the results that you are looking for. 使用Watson Natural Language了解,您将需要查看API资源管理器以找到正确的API语法,以用于获得所需的结果。

I also noticed that mention Word/PDF documents. 我还注意到提到了Word / PDF文档。 You will need to convert those using the Watson Discovery service, and then you can pass the converted documents to Watson Natural Language Understanding , which takes in JSON, text or HTML inputs. 您将需要使用Watson Discovery服务转换那些文档,然后将转换后的文档传递给Watson Natural Language了解 ,该文档接受JSON,文本或HTML输入。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM