简体   繁体   English

使用XSLT进行单词和短语计数

[英]Word and phrase counting with XSLT

We would like to build a dictionary of the documentation of the products our company makes, to create a fixed terminology, so we would like to count the frequency of specific words and phrases. 我们想建立一个公司产品文档的字典,创建一个固定的术语,因此我们要计算特定单词和短语的出现频率。

This could be solved in a couple of different ways, but what we would like to solve somehow is to write an XSLT algorithm which can recognize phrases, as specific words occuring together often (so we don't have to specify beforehand all the phrases and all their versions with different conjugations, affixations, etc.). 可以通过几种不同的方式解决此问题,但是我们想以某种方式解决的问题是编写一种XSLT算法,该算法可以识别短语,因为特定的单词经常一起出现(因此我们不必事先指定所有短语和所有版本的字词都有不同的变化,修饰词等)。

What do you think, could this task be done with XSLT, or should we look after other solutions? 您如何看待,可以使用XSLT完成此任务,还是应该照顾其他解决方案?

If anyone has any useful advice how we should start, I would be more than happy to hear about your ideas and have a conversation about this! 如果有人对我们应该如何开始有任何有用的建议,我将非常高兴听到您的想法并进行讨论!

You're looking for collocations, which in algorithmic terms is linked with Pointwise mutual information . 您正在寻找搭配,它在算法上与Pointwise互信息链接。

In XSLT, there is no framework for natural language processing (NLP), so you would have to invent one. 在XSLT中,没有用于自然语言处理(NLP)的框架,因此您必须发明一个框架。 However, there are NLP frameworks for programming languages, like Python's NLTK. 不过,也有NLP框架编程语言,如Python的NLTK。 Check out this example for finding collocations using Python . 查看此示例以使用Python查找搭配

It might be easiest to use an external app written in a popular data mining language like Python or R. (You could even plug it into your DITA OT processing.) You might also look at vendors with existing solutions. 使用以流行的数据挖掘语言(例如Python或R)编写的外部应用程序可能是最简单的。(您甚至可以将其插入DITA OT处理中。)您还可以考虑使用现有解决方案的供应商。 I haven't done any in-depth search for that, but I've seen systems like Watson, Semaphore, or even XDocs, return results from language analysis. 我没有对此进行任何深入的搜索,但是我已经看到Watson,Semaphore甚至XDocs之类的系统从语言分析中返回结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM