简体   繁体   English

什么数据模型可用于页面或文本的“含义”

[英]What data model can be used for the “meaning” of a page or text

I have read many times around the web about this question: 我已经多次在网上读过这个问题:

How do you extract the meaning of a page. 如何提取页面的含义。

And I know that I am not experience enough to even try to suggest any solution. 而且我知道我没有足够的经验来尝试提出任何解决方案。 To me this is the holy grail of web programming or maybe even computer technology as a whole. 对我而言,这是网络编程的圣杯,甚至可能是整个计算机技术。

But through the power of imagination let us assume that I have written the ultimate script that does exactly that. 但是,通过想象力,让我们假设我已经编写了完全符合这一要求的终极剧本。 For example I enter this text: 例如,我输入以下文字:

Imagination has brought mankind through the dark ages to its present state of civilization. 想象力使人类度过了黑暗时代,走向了现在的文明状态。 Imagination led Columbus to discover America. 想象力导致哥伦布发现美国。 Imagination led Franklin to discover electricity. 想象力导致富兰克林发现电力。

and my powerful script extracts the meaning and says this: 我强大的脚本提取了意义,并说:

The ability of human beings to think leads them to discover new things. 人类思考的能力使他们发现新事物。

For the purpose of this example, I used a "String" to explain the meaning the text. 出于本示例的目的,我使用“String”来解释文本的含义。 But if I had to store this in a database, or an array or any sort of storage, what will be the datatype I will be using? 但是如果我必须将它存储在数据库,数组或任何类型的存储中,那么我将使用的数据类型是什么?

Note that I can have another text that uses a different analogy but still has the same meaning worded differently, for example: 请注意,我可以使用另一个使用不同类比但仍具有相同含义的文本,例如:

Imagination helps human kind advance. 想象力有助于人类的进步。

Now I can enter a search query about the importance of imagination and these 2 results appear. 现在我可以输入关于想象力重要性的搜索查询,并显示这两个结果。 But how will they be matched? 但他们将如何匹配? Will it be a String comparison? 它会是一个字符串比较吗? Some integers, floating points? 有些整数,浮点数? Maybe even binary? 也许甚至二进制?

What will the meaning be saved under? 这意味着什么? I would like to hear from you. 我想听听你的意见。

Update: Let me restate the question simply. 更新:让我简单地重述一下这个问题。

How do you represent Meaning in data? 你如何在数据中表示意义?

Assuming that our brains do not have access to a metaphysical cloud server, meaning is represented as configuration of neuronal connections, hormonal levels, electrical activity -- maybe even quantum fluctuations -- and the interaction between all these and the outer world and other brains. 假设我们的大脑无法访问形而上学的云服务器, 意味着神经元连接,激素水平,电活动 - 甚至是量子波动 - 以及所有这些与外部世界和其他大脑之间的相互作用。 So this is good news: at least we know that there is -- at least -- one answer to your question (meaning is represented somewhere, somehow). 所以这是个好消息:至少我们知道你的问题至少有一个答案(意思是在某处,某种程度上表示)。 Bad news is that most of us do not have any idea how this works and those who think they do understand haven't been able to convince the others or each other. 坏消息是,我们大多数人都不知道这是如何运作的,那些认为他们理解的人无法说服其他人或彼此。 Being one of the clueless people, I can't give the answer to your question, but provide a list of the answers that I have come across to smaller and degenerated versions of the grand problem. 作为一个无能为力的人,我无法给出你的问题的答案,但提供了一个列表,我已经遇到了大问题的较小和退化版本。

If you want to represent the meaning of lexical entities (eg, concepts, actions) you can use distributed models such as vector space models . 如果要表示词法实体的含义(例如,概念,动作),可以使用分布式模型,例如向量空间模型 In these models, usually, meaning has a geometric component. 在这些模型中,通常意义具有几何分量。 Each concept is represented as a vector and you place the concepts in a space in such a way that similar concepts are closer to each other. 每个概念都表示为一个向量,您可以将概念放在一个空间中,使类似的概念彼此更接近。 A very common way to construct such a space is to pick a set of commonly used words (basis words) as the dimensions of the space and simply count the number of times a target concept is observed together in speech/text with these basis words. 构造这样一个空间的一种非常常见的方法是选择一组常用词(基础词)作为空间的维度,并简单地计算目标概念在语音/文本中与这些基础词一起被观察的次数。 Similar concepts will be used in similar contexts; 类似的概念将用于类似的背景; thus, their vectors will be pointing similar directions. 因此,他们的向量将指向相似的方向。 On top of that you can carry out a bunch of weighting, normalization, dimensionality reduction and recombination techniques (eg, tf-idf , http://en.wikipedia.org/wiki/Pointwise_mutual_information , SVD ). 最重要的是可以进行一堆加权,归一化,降维和重组技术(例如,顶部TF-IDFhttp://en.wikipedia.org/wiki/Pointwise_mutual_informationSVD )。 A slightly related, but probabilistic -- rather than geometric -- approach is latent Dirichlet allocation and other generative/Bayesian models which are already mentioned in another answer. 一个略微相关,但概率 - 而不是几何 - 的方法是潜在的Dirichlet分配和其他生成/贝叶斯模型已经在另一个答案中提到。

Vector space model approach is good for discriminative purposes. 向量空间模型方法有利于辨别目的。 You can decide whether two given phrases are semantically related or not (for example matching queries to documents or finding similar search query pairs to help the user to expand his query). 您可以决定两个给定短语是否在语义上相关(例如,将查询与文档匹配或查找类似的搜索查询对以帮助用户扩展其查询)。 But it is not very straightforward to incorporate syntax in these models. 但是在这些模型中合并语法并不是非常简单。 I can't see very clearly how you could represent the meaning of a sentence by a vector. 我无法清楚地看到你如何通过向量来表示句子的含义。

Grammar formalisms could help to incorporate syntax and bring a structure to meaning and the relations between the concepts (eg, head-driven phrase structure grammar ). 语法形式可以帮助整合语法并将结构带入意义和概念之间的关系(例如, 头部驱动的短语结构语法 )。 If you build two agents who share a vocabulary and grammar and make them communicate (ie, transfer information from one to the other) via these mechanisms you could say they represent the meaning. 如果您构建两个共享词汇和语法的代理并通过这些机制进行通信(即,将信息从一个传递到另一个),您可以说它们代表了含义。 It is rather a philosophical question where and how the meaning is represented when a robot tells another to pick the "red circle above the black box" via a built-in or emerged grammar and vocabulary and the other one successfully picks the intended object (see this very interesting experiment on grounding vocabulary: Talking Heads ). 当一个机器人告诉另一个人通过内置或出现的语法和词汇选择“黑盒子上面的红色圆圈”而另一个人成功地挑选出预期的物体时,这意味着在何处以及如何表示意义。(见这个非常有趣的实验基础词汇:会说话的头脑

Another way to capture meaning is to use networks. 捕获意义的另一种方法是使用网络。 For example, by representing each concept as a node in a graph and the relations between the concepts as edges between the nodes, one can come up with a practical representation of meaning. 例如,通过将每个概念表示为图中的节点以及概念之间的关系作为节点之间的边缘,可以提出实际的意义表示。 Concept Net is a project that aims to represent common sense and it is possible to view it as a semantic network of commonsense concepts. Concept Net是一个旨在表达常识的项目,可以将其视为常识概念的语义网络。 In a way, the meaning of a certain concept is represented via its location relative to other concepts in the network. 在某种程度上,某个概念的含义通过其相对于网络中其他概念的位置来表示。

Speaking of common sense, Cyc is another ambitious example of a project that tries to capture commonsense knowledge, but it does so in a very different way than Concept Net. 说到常识, Cyc是一个试图捕捉常识知识的项目的另一个雄心勃勃的例子,但它以与Concept Net截然不同的方式实现。 Cyc uses a well-defined symbolic language to represent the attributes of objects and the relations between objects in a non-ambiguous way. Cyc使用定义明确的符号语言以非模糊的方式表示对象的属性和对象之间的关系。 By employing a very large set of rules and concepts, and an inference engine, one can come up with deductions about the world, answer questions like "Can horses be sick?", "Bring me a picture of a sad person." 通过使用一套非常大的规则和概念以及推理引擎,人们可以得出关于世界的推论,回答诸如“马可生病吗?”,“给我一张悲伤的人的照片”等问题。

I worked on a system that attempted to do this at a previous company. 我曾在一家试图在以前的公司做过这个的系统上工作过。 We were more focused on "what unstructured documents are most similar to this unstructured document", but the relevant part was how we determined the "meaning" of the document. 我们更关注“非结构化文档与此非结构化文档最相似的内容”,但相关部分是我们如何确定文档的“含义”。

We used two different algorithms, PLSA (Probabilistic Latent Semantic Analysis) and PSVM (Probabilistic Support Vector Machines). 我们使用了两种不同的算法,即PLSA(概率潜在语义分析)和PSVM(概率支持向量机)。 Both extract topics that are significantly more prevalent in the document being analyzed than in other documents in the collection. 两者都提取的主题在被分析的文档中比在集合中的其他文档中更为普遍。

The topics themselves have numerical IDs, and there was an xref table from document to topic. 主题本身具有数字ID,并且从文档到主题有一个外部参照表。 To determine how close two documents were, we would look at the percentage of topics the documents have in common. 为了确定两个文档的接近程度,我们将查看文档共有的主题百分比。

Presuming your super script could produce topics from the query entered, you could use a similar structure. 假设您的超级脚本可以根据输入的查询生成主题,您可以使用类似的结构。 It has the added advantage of the xref table only containing integers, so you're only looking at integers not string operations. 它具有仅包含整数的外部参照表的附加优点,因此您只查看整数而不是字符串操作。

Semantics is a wide and deep field, and there are many models, all of them with advantages and problems from an AI implementation point of view. 语义是一个广泛而深入的领域,并且有很多模型,从AI实现的角度来看,它们都具有优势和问题。 With this scarce amount of background, one can hardly make a recommendation, beyond "study the literature, and pick a theory which resonates with your intuition (and if you are at all successful in this, replace it with a better theory of your own, and score academic points)". 有了这么少的背景,人们很难提出建议,除了“研究文献,选择一个与你的直觉产生共鸣的理论(如果你在这方面取得了成功,用你自己更好的理论取代它,并获得学分)“。 Having said that, the freshman course material I can vaguely recollect used to have nice things to say about a recursive structure called a "frame", but this must have been 15 years ago. 话虽如此,我可以模糊地回忆起的新生课程材料过去常常有一个关于称为“框架”的递归结构的好话,但这必定是在15年前。

A meaning is in general an abstract concept which is a internal black box data structure which depends on the chosen algorithm. 意义通常是抽象概念,其是取决于所选算法的内部黑匣子数据结构。 But this is not the interesting part. 但这不是有趣的部分。 If you do some semantic analysis the general question concerns differences in meanings, eg if two documents talk about the same topic, or how different some docs are, or to group documents with similar meanings. 如果您进行一些语义分析,则一般性问题涉及意义的差异,例如,如果两个文档谈论相同的主题,或者某些文档有多么不同,或者对具有相似含义的文档进行分组。

If you use a vector space model, the meaning / semantics can be represented by a collection of vectors which represent specific topics. 如果使用向量空间模型,则意义/语义可以由表示特定主题的向量集合表示。 One way to extract such patterns is http://en.wikipedia.org/wiki/Latent_semantic_analysis or http://en.wikipedia.org/wiki/Nonnegative_matrix_factorization . 提取此类模式的一种方法是http://en.wikipedia.org/wiki/Latent_semantic_analysishttp://en.wikipedia.org/wiki/Nonnegative_matrix_factorization But there are more elaborate statistical models which represent semantics by parameters of certain probability distributions. 但是有更复杂的统计模型,它们通过某些概率分布的参数来表示语义。 A recent method is http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation . 最近的方法是http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

I will talk about Semantic Web because I think it offers the most advanced studies and language implementations about the subject. 我将讨论语义Web,因为我认为它提供了关于该主题的最先进的研究和语言实现。

Resource Description Framework is one of the many data models inherent to Semantic Web available to describe informations. 资源描述框架是可用于描述信息的语义Web固有的众多数据模型之一。

RDF is an abstract model with several serialization formats (ie, file formats), and so the particular way in which a resource or triple is encoded varies from format to format RDF是一种具有多种序列化格式(即文件格式)的抽象模型,因此编码资源或三元组的特定方式因格式而异

and

However, in practice, RDF data is often persisted in relational database or native representations also called Triplestores, or Quad stores if context (ie the named graph) is also persisted for each RDF triple. 但是,在实践中,RDF数据通常持久存储在关系数据库或本地表示(也称为Triplestores)或Quad存储中,如果上下文(即命名图)也为每个RDF三元组保留。

RDF content can be retrieved using RDF Queries . 可以使用RDF查询检索RDF内容。


Topic Maps another model of knowledge data storing and representation. 主题映射另一种知识数据存储和表示模型。

Topic Maps is a standard for the representation and interchange of knowledge, with an emphasis on the findability of information. 主题地图是知识表示和交换的标准,强调信息的可查找性。

and

In the year 2000 Topic Maps was defined in an XML syntax XTM. 在2000年,Topic Maps是用XML语法XTM定义的。 This is now commonly known as "XTM 1.0" and is still in fairly common use. 现在这通常被称为“XTM 1.0”,并且仍然相当普遍。

From the official Topic Maps Data Model : 从官方主题地图数据模型

The only atomic fundamental types defined in this part of ISO/IEC13250 (in 4.3) are strings and null. ISO / IEC13250(4.3中)的这一部分中定义的唯一原子基本类型是字符串和null。 Through the concept of datatypes, data of any type can be represented in this model. 通过数据类型的概念,可以在此模型中表示任何类型的数据。 All datatypes used shall have a string representation of their value space and this string representation is what is stored in the topic map. 使用的所有数据类型都应具有其值空间的字符串表示形式,并且此字符串表示形式存储在主题图中。 The information about which datatype the value belongs to is stored separately, in the form of a locator identifying the datatype. 有关该值所属的数据类型的信息将以标识数据类型的定位符的形式单独存储。

There are many other formats proposed, you can take a look at this article for more informations. 提出了许多其他格式,您可以查看本文以获取更多信息。

I also want to link you a recent answer I wrote about a similar topic with a lot of useful links. 我还想链接你最近的一个回答,我写了一个类似的话题,有很多有用的链接。


After reading various articles, I think a common direction every method is taking is storing data as a text format . 在阅读各篇文章之后,我认为每种方法都采用的共同方向是将数据存储为文本格式 The relative information can be stored in a database directly as text. 相关信息可以直接作为文本存储在数据库中。

Having the data in an understandable text format has several benefits, perhaps more than the disadvantages. 以可理解的文本格式提供数据有几个好处,可能不仅仅是缺点。

Other Semantic methods such as Notation 3 (N3) or Turtle Syntax use slight different formats, but still plain text. 其他语义方法,如符号3(N3)或Turtle语法使用略有不同的格式,但仍然是纯文本。

A N3 example 一个N3的例子

@prefix dc: <http://purl.org/dc/elements/1.1/>.

<http://en.wikipedia.org/wiki/Tony_Benn>
  dc:title "Tony Benn";
  dc:publisher "Wikipedia".

Finally, I would like to link you an useful article you should read: Standardization of Unstructured Textual Data into Semantic Web Format . 最后,我想向您链接一篇您应该阅读的有用文章: 将非结构化文本数据标准化为语义Web格式

Let's assume that you have found the ultimate algorithm that can provide the meaning of a text. 让我们假设您已经找到了可以提供文本含义的终极算法。 In particular you selected a string representation, but considering your algorithm found the meaning correctly, then it can be uniquely identified by the algorithm. 特别是您选择了一个字符串表示,但考虑到您的算法正确找到了含义,那么它可以由算法唯一标识。 Right? 对?

So, for simplicity let's assume there is only one meaning for that particular text. 因此,为简单起见,我们假设该特定文本只有一个含义。 In this case it is uniquely identified before the algorithm will output a phrase describing it. 在这种情况下,在算法输出描述它的短语之前,它是唯一标识的。

So, basically, in order to store a meaning we first need a unique identifier. 所以,基本上,为了存储意义,我们首先需要一个唯一的标识符。

The meaning can only exist in rapport with a subject. 意义只能存在于与主题的关系中。 It is the meaning of a subject. 这是一个主题的意义。 In order for that subject to have a meaning we must know something about it. 为了使该主题具有意义,我们必须对其有所了解。 In order for a subject to have a unique meaning it must be represented unambiguously to the observer (that is the algorithm). 为了使主体具有独特的意义,必须明确地向观察者表示(即算法)。 For example the statement "2 = 3" will have the meaning of false because of standardization of mathematics symbols. 例如,由于数学符号的标准化,语句“2 = 3”将具有假的含义。 But a text written in a foreign language will have no meaning for us. 但用外语写的文字对我们没有任何意义。 Neither anything that we can't understand. 没有任何我们无法理解的东西。 For example "what is the meaning of life?" 例如“生命的意义是什么?”

In conclusion, in order to build an algorithm that can extract the absolute meaning from any random text, we, as humans, must first know the absolute meaning of anything. 总之,为了构建一个可以从任何随机文本中提取绝对意义的算法,我们作为人类必须首先知道任何事物的绝对意义。 :) :)

In practice, you can only extract the meaning of a known text, written in a known language, in a known format. 实际上,您只能以已知格式提取已知语言的已知文本的含义。 And for this, there are tools and research in the fields of neural networks, natural language processing and so on... 为此,在神经网络,自然语言处理等领域有工具和研究......

try making it into a char* (string c-style) it is easily stored in databases and easy to use make it of length 50 (10 words) or 75 (15 words) 尝试将其变成char *(字符串c风格)它很容易存储在数据库中并且易于使用,使其长度为50(10个字)或75个(15个字)

EDIT: put both on the same word (imagination) then check for similar indexes and assign them to the same word 编辑:将两者放在同一个单词(想象力)上,然后检查相似的索引并将它们分配给相同的单词

use 使用

SELECT * FROM Dictionary WHERE Index = "Imagination"

sorry I'm not too experienced with SQL 抱歉,我对SQL不太熟悉

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 有人可以解释一下消息代理的用途吗? - Can someone explain what message brokers are used for? 什么属性应该属于页面,什么属性应该属于模型? - What attributes should belong to a page and what should belong to a model? 更改 model 中数据的最佳方法是什么? - What is the best way to change data in a model? 用于数据访问层的一些常用策略和/或框架是什么? - What are some of the common strategies and/or frameworks used for data access layers? 可以使用哪些硬件或软件解决方案来防止堆栈溢出? - What hardware or software solutions can be used to prevent of stack overflows? 大型企业功能性F#开发中常用的架构模型是什么? - What architectural model is commonly used in large, enterprise, functional F# development? 什么是模型,什么不是模型php - What is a model and what is not a model php 是否可以使用本体来为数据转换器生成代码? - Can and should an ontology be used to generate code for data converters? 在TOGAF中ADM的机会和解决方案阶段的机会意义是什么? - What is the Meaning of Opportunities in Opportunities and Solutions Phase of ADM in TOGAF 如何使用Redux中的级联更新来建模相互依赖的数据? - How can I model interdependent data with cascading updates in Redux?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM