[英]Best way to extract Key-Value Pairs from unstructured String?
Avoiding at most hard-coded rules for specific patterns. 避免特定模式的大多数硬编码规则。
I'm currently working on a similar project as AWS Textract, link here . 我目前正在开发与AWS Textract类似的项目,请点击此处链接 。 I've been successful at extracting data from files, but in an unstructured way.
我已经成功地从文件中提取数据,但是以非结构化的方式。 Now, i'm trying to figure out, and in the best ways, how to get existing Key-Value Pairs from that bunch of information.
现在,我试图弄清楚如何从这一堆信息中获取现有的Key-Value Pairs,并以最佳方式。
For example we have a text like that : 例如,我们有这样的文字:
In this document we will find different key and values like this id : 1 and that country : France with no specific punctuation and probably talking about how good is my health...
在本文档中,我们将找到不同的键和值,如id:1和那个国家:法国没有特定的标点符号,可能还在谈论我的健康状况有多好......
The extraction would be something like that : 提取将是这样的:
id : 1
country : France
health : good
What i actually know is that Amazon use a "confidence" variable into extracting information from that kind of scenario, which i guess involve some machine-learning algorithm. 我真正知道的是,亚马逊使用“置信度”变量从这种场景中提取信息,我猜这涉及一些机器学习算法。 In my case, i don't have that big of a database to learn from.
就我而言,我没有那么大的数据库可供学习。
I'm pretty sure that there is an easier solution neither less flexible. 我很确定有一个更简单的解决方案,既不灵活。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.