简体繁体 English

是否可以将 solr 中的子文档集成到胡萝卜工作台中？

[英]Is it possible to integrate child documents from solr into carrot workbench?

原文 2020-12-21 10:37:05 2 1 search/ solr/ cluster-analysis/ parent-child/ carrot2

In my solr database I have a structure that looks like this: A parent document representing names of people (dictionary).在我的 solr 数据库中，我的结构如下所示：代表人名（字典）的父文档。 These parent documents also contain nested child documents where the documents that match these people's names appear (nested list of dictionaries).这些父文档还包含嵌套的子文档，其中出现与这些人的姓名匹配的文档（字典的嵌套列表）。

When I try to cluster the information in a way that makes sense, I am only able to cluster directly the child documents, which results in a bunch of clustered keywords that belong to those texts.当我尝试以一种有意义的方式对信息进行聚类时，我只能直接对子文档进行聚类，这会产生一堆属于这些文本的聚类关键字。

Ideally, I would like to cluster people (parent documents) in terms of the similarity of their nested child documents.理想情况下，我想根据嵌套子文档的相似性对人（父文档）进行聚类。 SO rather than having key words from texts clustered together, I would like to cluster people's names that have similar content.因此，我不想将文本中的关键词聚集在一起，而是将具有相似内容的人的名字聚集在一起。

Eg if Bob, John, Lewis profiles all have child documents that contain the text "We are highly skilled in Python";例如，如果Bob、John、Lewis的个人资料都有包含文本“我们精通 Python”的子文档； and Dan, Maria, Chris profiles have child documents that contain the text "We are highly skilled in Java".和Dan、Maria、Chris的个人资料有包含文本“我们精通 Java”的子文档。 I would like a cluster of ( Bob, John, Lewis ) and a cluster of ( Dan, Maria, Chris ).我想要一组（鲍勃，约翰，刘易斯）和一组（丹，玛丽亚，克里斯）。 So, when we click on the first cluster, we get the result "We are highly skilled in Python", and for the second cluster, we get the result "we are highly skilled in Java".所以，当我们点击第一个集群时，我们得到的结果是“我们精通 Python”，而对于第二个集群，我们得到的结果是“我们精通 Java”。

Is there a way of reproducing such a structure on carrot workbench?有没有办法在胡萝卜工作台上复制这种结构？

1 个解决方案

Unfortunately not.不幸的是没有。 This is a pretty specific scenario and we aim to keep Workbench a generic tool with Solr being one of many document sources.这是一个非常具体的场景，我们的目标是让 Workbench 成为一个通用工具，其中 Solr 是众多文档来源之一。

For this kind of parent-child clustering, you'd need to directly use Carrot2 Java or REST API:对于这种父子集群，您需要直接使用 Carrot2 Java 或 REST API：

Fetch child documents from Solr, cluster them in Carrot2.从 Solr 中获取子文档，将它们聚集在 Carrot2 中。
For each cluster C:对于每个集群 C：
- create a new cluster CC with the same label as cluster C,使用与集群 C 相同的 label 创建一个新集群 CC，
- for each child document D in cluster C, take the child's parent document P and put the parent in cluster CC.对于集群 C 中的每个子文档 D，将子文档 P 放入集群 CC 中。
- put cluster CC in the set of parent clusters.将集群 CC 放入父集群集合中。

As a result of the above procedure, you'll have a set of clusters containing parent documents clustered by the textual content of the documents' child documents.作为上述过程的结果，您将拥有一组包含父文档的集群，这些父文档由文档的子文档的文本内容组成。