简体   繁体   English

如何确定文档的(自然)语言?

[英]How to determine the (natural) language of a document?

I have a set of documents in two languages: English and German. 我有一套两种语言的文件:英语和德语。 There is no usable meta information about these documents, a program can look at the content only. 没有关于这些文档的可用元信息,程序只能查看内容。 Based on that, the program has to decide which of the two languages the document is written in. 基于此,程序必须决定编写文档的两种语言中的哪一种。

Is there any "standard" algorithm for this problem that can be implemented in a few hours' time? 这个问题是否有任何“标准”算法可以在几个小时内实现? Or alternatively, a free .NET library or toolkit that can do this? 或者,可以执行此操作的免费.NET库或工具包? I know about LingPipe , but it is 我知道LingPipe ,但确实如此

  1. Java Java的
  2. Not free for "semi-commercial" usage 不适用于“半商业”用途

This problem seems to be surprisingly hard. 这个问题似乎非常困难。 I checked out the Google AJAX Language API (which I found by searching this site first), but it was ridiculously bad. 我检查了谷歌AJAX语言API (我通过首先搜索这个网站找到),但它是非常糟糕的。 For six web pages in German to which I pointed it only one guess was correct. 对于我指出的六个德语网页,只有一个猜测是正确的。 The other guesses were Swedish, English, Danish and French... 其他猜测是瑞典语,英语,丹麦语和法语......

A simple approach I came up with is to use a list of stop words. 我想出的一个简单方法是使用一个停用词列表。 My app already uses such a list for German documents in order to analyze them with Lucene.Net. 我的应用程序已经使用德国文档的这样一个列表,以便用Lucene.Net进行分析。 If my app scans the documents for occurrences of stop words from either language the one with more occurrences would win. 如果我的应用程序扫描文档中是否出现任何一种语言的停用词,那么具有更多出现次数的那些将获胜。 A very naive approach, to be sure, but it might be good enough. 一个非常天真的方法,可以肯定,但它可能已经足够好了。 Unfortunately I don't have the time to become an expert at natural-language processing, although it is an intriguing topic. 不幸的是,我没有时间成为自然语言处理方面的专家,尽管这是一个有趣的话题。

Try measure occurences of each letter in text. 尝试测量文本中每个字母的出现次数。 For English and German texts are calculated the frequencies and, maybe, the distributions of them. 对于英语和德语文本,计算频率,也可以计算它们的分布。 Having obtained these data, you may reason what language the distribution of frequencies for your text belongs. 获得这些数据后,您可以推断出文本频率分布所属的语言。

You should use Bayesian inference to determine the closest language (with a certain error probability) or, maybe, there are other statistical methods for such tasks. 您应该使用贝叶斯推理来确定最接近的语言(具有一定的错误概率),或者,可能还有其他统计方法来执行此类任务。

The problem with using a list of stop words is one of robustness. 使用停用词列表的问题是鲁棒性。 Stop word lists are basically a set of rules, one rule per word. 停止单词列表基本上是一组规则,每个单词一个规则。 Rule-based methods tend to be less robust to unseen data than statistical methods. 与统计方法相比,基于规则的方法对于看不见的数据往往不那么健壮。 Some problems you will encounter are documents that contain equal counts of stop words from each language, documents that have no stop words, documents that have stop words from the wrong language, etc. Rule-based methods can't do anything their rules don't specify. 您将遇到的一些问题是每种语言包含相同数量的停用词的文档,没有停用词的文档,使用错误语言停止文字的文档等。基于规则的方法无法执行任何规则t指定。

One approach that doesn't require you to implement Naive Bayes or any other complicated math or machine learning algorithm yourself, is to count character bigrams and trigrams (depending on whether you have a lot or a little of data to start with -- bigrams will work with less training data). 一种不需要你自己实现朴素贝叶斯或任何其他复杂的数学或机器学习算法的方法是计算字符双字母和三元组(取决于你是否有很多或一些数据开始 - bigrams将使用较少的培训数据工作)。 Run the counts on a handful of documents (the more the better) of known source language and then construct an ordered list for each language by the number of counts. 对已知源语言的少数文档(越多越好)运行计数,然后按计数数量为每种语言构造有序列表。 For example, English would have "th" as the most common bigram. 例如,英语将“th”作为最常见的二元组。 With your ordered lists in hand, count the bigrams in a document you wish to classify and put them in order. 使用您的有序列表,计算您希望分类的文档中的双字母并按顺序排列。 Then go through each one and compare its location in the sorted unknown document list to the its rank in each of the training lists. 然后浏览每一个并将其在已排序的未知文档列表中的位置与其在每个训练列表中的排名进行比较。 Give each bigram a score for each language as 给每个二元组一个每种语言的分数

1 / ABS(RankInUnknown - RankInLanguage + 1) . 1 / ABS(RankInUnknown - RankInLanguage + 1)

Whichever language ends up with the highest score is the winner. 无论哪种语言得分最高,都是赢家。 It's simple, doesn't require a lot of coding, and doesn't require a lot of training data. 它很简单,不需要大量编码,也不需要大量的训练数据。 Even better, you can keep adding data to it as you go on and it will improve. 更好的是,您可以继续向其中添加数据,并且它会得到改善。 Plus, you don't have to hand-create a list of stop words and it won't fail just because there are no stop words in a document. 此外,您不必手动创建一个停用词列表,它不会因为文档中没有停用词而失败。

It will still be confused by documents that contain equal symmetrical bigram counts. 它仍然会被包含相等对称二元组计数的文档所混淆。 If you can get enough training data, using trigrams will make this less likely. 如果你能获得足够的训练数据,使用三元组将降低这种可能性。 But using trigrams means you also need the unknown document to be longer. 但是使用trigrams意味着您还需要更长的未知文档。 Really short documents may require you to drop down to single character (unigram) counts. 真正短的文档可能要求您下拉到单个字符(unigram)计数。

All this said, you're going to have errors. 所有这些说,你会有错误。 There's no silver bullet. 没有银弹。 Combining methods and choosing the language that maximizes your confidence in each method may be the smartest thing to do. 结合方法并选择最能使您对每种方法充满信心的语言可能是最明智的做法。

English and German use the same set of letters except for ä, ö, ü and ß (eszett). 除ä,ö,ü和ß(eszett)外,英语和德语使用相同的字母组。 You can look for those letters for determining the language. 您可以查找这些字母来确定语言。

You can also look at this text ( Comparing two language identification schemes ) from Grefenstette. 您还可以查看Grefenstette的这篇文章( 比较两种语言识别方案 )。 It looks at letter trigrams and short words. 它着眼于字母三字母和短字。 Common trigrams for german en_, er_, _de. 德语en_,er_,_de的常见三元组。 Common trigrams for English the_, he_, the... 英语的常见三元组the_,he_,...

There's also Bob Carpenter's How does LingPipe Perform Language ID? 还有Bob Carpenter的LingPipe如何执行语言ID?

I believe the standard procedure is to measure the quality of a proposed algorithm with test data (ie with a corpus ). 我相信标准程序是用测试数据(即使用语料库 )来测量所提算法的质量。 Define the percentage of correct analysis that you would like the algorithm to achieve, and then run it over a number of documents which you have manually classified. 定义您希望算法实现的正确分析的百分比,然后在您手动分类的许多文档上运行它。

As for the specific algorithm: using a list of stop words sounds fine. 至于具体的算法:使用停止词列表听起来很好。 Another approach that has been reported to work is to use a Bayesian Filter , eg SpamBayes . 据报道,另一种方法是使用贝叶斯过滤器 ,例如SpamBayes Rather than training it into ham and spam, train it into English and German. 而不是将其训练成火腿和垃圾邮件,而是将其训练成英语和德语。 Use a portion of your corpus, run that through spambayes, and then test it on the complete data. 使用语料库的一部分,通过spambayes运行,然后在完整数据上测试它。

Language detection is not very difficult conceptually. 语言检测在概念上并不是很困难。 Please look at my reply to a related question and other replies to the same question. 请查看我对相关问题的回复以及对同一问题的其他回复。

In case you want to take a shot at writing it yourself, you should be able to write a naive detector in half a day. 如果你想自己写一下,你应该能够在半天内写一个天真的探测器。 We use something similar to the following algorithm at work and it works surprisingly well. 我们在工作中使用类似于以下算法的东西,它的效果非常好。 Also read the python implementation tutorial in the post I linked. 另请阅读我链接的帖子中的python实现教程。

Steps : 步骤

  1. Take two corpora for the two languages and extract character level bigrams, trigrams and whitespace-delimited tokens (words). 为两种语言取两个语料库,提取字符级别的双字母组,三字组和空格分隔的标记(单词)。 Keep a track of their frequencies. 跟踪他们的频率。 This step builds your "Language Model" for both languages. 此步骤为两种语言构建“语言模型”。

  2. Given a piece of text, identify the char bigrams, trigrams and whitespace-delimited tokens and their corresponding "relative frequencies" for each corpus. 给定一段文本,识别char字母,三元组和空格分隔的标记以及每个语料库的相应“相对频率”。 If a particular "feature" (char bigram/trigram or token) is missing from your model, treat its "raw count" as 1 and use it to calculate its "relative frequency". 如果模型中缺少特定的“特征”(char bigram / trigram或token),则将其“原始计数”视为1并使用它来计算其“相对频率”。

  3. The product of the relative frequencies for a particular language gives the "score" for the language. 特定语言的相对频率的乘积给出了语言的“得分”。 This is a very-naive approximation of the probability that the sentence belongs to that language. 这是句子属于该语言的概率的非常幼稚的近似

  4. The higher scoring language wins. 得分较高的语言获胜。

Note 1: We treat the "raw count" as 1 for features that do not occur in our language model. 注1:对于我们的语言模型中没有出现的功能,我们将“原始计数”视为1。 This is because, in reality, that feature would have a very small value but since we have a finite corpus, we may not have encountered it yet. 这是因为,实际上,该特征的价值非常小,但由于我们有一个有限的语料库,我们可能还没有遇到它。 If you take it's count to be zero, then your entire product would also be zero. 如果你认为它是零,那么你的整个产品也将为零。 To avoid this, we assume that it's occurence is 1 in our corpus. 为了避免这种情况,我们假设它的出现在我们的语料库中是1。 This is called add-one smoothing. 这称为加一平滑。 There are other advance smoothing techniques . 还有其他先进的平滑技术

Note 2: Since you will be multiplying large number of fractions, you can easily run to zero. 注意2:由于您将乘以大量分数,因此您可以轻松地运行到零。 To avoid this, you can work in a logarithmic space and use this equation to calculate your score. 为避免这种情况,您可以在对数空间中工作并使用此公式计算您的分数。

                a X b =  exp(log(a)+log(b))

Note 3: The algorithm I described is a "very-naive" version of the " Naive Bayes Algorithm ". 注3:我描述的算法是“ 朴素贝叶斯算法 ”的“非常天真”版本。

The stop words approach for the two languages is quick and would be made quicker by heavily weighting ones that don't occur in the other language "das" in German and "the" in English, for example. 这两种语言的停用词方法很快,并且可以通过对那些在德语中用“das”和“英语”中没有出现的语言进行加权来加快。 The use of the "exclusive words" would help extend this approach robustly over a larger group of languages as well. 使用“专有词汇”有助于在更大的语言群体中强有力地扩展这种方法。

If you're looking to flex your programming muscles trying to solve the problem yourself, I encourage you to; 如果您希望自己灵活地解决问题,那么我鼓励您; however, the wheel exists if you would like you use it. 但是,如果你想使用它,车轮就存在了。

Windows 7 ships with this functionality built in. A component called "Extended Linguistic Services" (ELS) has the ability to detect scripts and natural languages, and it's in the box, on any Windows 7 or Windows Server 2008 machine. Windows 7附带内置的此功能。名为“扩展语言服务”(ELS)的组件能够检测脚本和自然语言,并且可以在任何Windows 7或Windows Server 2008计算机上使用。 Depending on whether you have any such machines available and what you mean when you say "free," that will do it for you. 取决于您是否有任何此类机器可用以及当您说“免费”时您的意思,它将为您完成。 In any case, this is an alternative to Google or the other vendors mentioned here. 无论如何,这是Google或此处提到的其他供应商的替代品。

http://msdn.microsoft.com/en-us/library/dd317700(v=VS.85).aspx http://msdn.microsoft.com/en-us/library/dd317700(v=VS.85).aspx

And if you want to access this from .NET, there's some information on that here: 如果你想从.NET访问它,那里有一些信息:

http://windowsteamblog.com/blogs/developers/archive/2009/05/18/windows-7-managed-code-apis.aspx http://windowsteamblog.com/blogs/developers/archive/2009/05/18/windows-7-managed-code-apis.aspx

Hope that helps. 希望有所帮助。

Isn't the problem several orders of magnitude easier if you've only got two languages (English and German) to choose from? 如果您只有两种语言(英语和德语)可供选择,问题不是几个数量级的问题吗? In this case your approach of a list of stop words might be good enough. 在这种情况下,您对停用词列表的处理可能已经足够了。

Obviously you'd need to consider a rewrite if you added more languages to your list. 如果您在列表中添加了更多语言,显然您需要考虑重写。

First things first, you should set up a test of your current solution and see if it reaches your desired level of accuracy. 首先,您应该对当前解决方案进行测试,看看它是否达到了您所需的准确度。 Success in your specific domain matters more than following a standard procedure. 在特定领域取得成功不仅仅是遵循标准程序。

If your method needs improving, try weighting your stop words by the rarity in a large corpus of English and German. 如果你的方法需要改进,试着用大量英​​语和德语中的稀有词来加权你的停用词。 Or you could use a more complicated technique like training a Markov model or Bayesian classifier . 或者您可以使用更复杂的技术,如训练马尔可夫模型贝叶斯分类器 You could expand any of the algorithms to look at higher-order n-grams (for example, two or three word sequences) or other features in the text. 您可以扩展任何算法以查看高阶n-gram (例如,两个或三个单词序列)或文本中的其他功能。

You can use the Google Language Detection API. 您可以使用Google语言检测API。

Here is a little program that uses it: 这是一个使用它的小程序:

baseUrl = "http://ajax.googleapis.com/ajax/services/language/detect"

def detect(text):
    import json,urllib
    """Returns the W3C language code of a natural language"""

    params = urllib.urlencode({'v': '1.0' , "q":text[0:3000]}) # only use first 3000 characters                    
    resp = json.load(urllib.urlopen(baseUrl + "?" + params))
    try:
        retText = resp['responseData']['language']
    except:
        raise
    return retText


def test():
    print "Type some text to detect its language:"
    while True:
        text = raw_input('#>  ')
        retText = detect(text)
        print retText


if __name__=='__main__':
    import sys
    try:
        test()
    except KeyboardInterrupt:
        print "\n"
        sys.exit(0)

Other useful references: 其他有用的参考:

Google Announces APIs (and demo): http://googleblog.blogspot.com/2008/03/new-google-ajax-language-api-tools-for.html Google宣布API(和演示版): http//googleblog.blogspot.com/2008/03/new-google-ajax-language-api-tools-for.html

Python wrapper: http://code.activestate.com/recipes/576890-python-wrapper-for-google-ajax-language-api/ Python包装器: http//code.activestate.com/recipes/576890-python-wrapper-for-google-ajax-language-api/

Another python script: http://www.halotis.com/2009/09/15/google-translate-api-python-script/ 另一个python脚本: http//www.halotis.com/2009/09/15/google-translate-api-python-script/

RFC 1766 defines W3C languages RFC 1766定义了W3C语言

Get the current language codes from: http://www.iana.org/assignments/language-subtag-registry 从以下网址获取当前语言代码: http//www.iana.org/assignments/language-subtag-registry

Have you tried Apache Tika ? 你试过Apache Tika吗? It can determine the language of a given text: 它可以确定给定文本的语言:

http://www.dovetailsoftware.com/blogs/kmiller/archive/2010/07/02/using-the-tika-java-library-in-your-net-application-with-ikvm http://www.dovetailsoftware.com/blogs/kmiller/archive/2010/07/02/using-the-tika-java-library-in-your-net-application-with-ikvm

I have no experience with .Net but that link might help. 我没有.Net的经验,但该链接可能有所帮助。 If you can execute a jar in your environment, try this: 如果您可以在您的环境中执行jar,请尝试以下操作:

 java -jar tika-app-1.0.jar -l http://www.admin.ch/

Output: 输出:

de

Hope that helps. 希望有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM