简体   繁体   English

soundex算法的数据结构?

[英]Data structure for soundex algorithm?

Can anyone suggest me on what data structure to use for a soundex algorithm program? 谁能为我建议在soundex算法程序中使用哪种数据结构? The language to be used is Java. 使用的语言是Java。 If anybody has worked on this before in Java. 如果有人以前用Java进行过这项工作。 The program should have these features: be able to read about 50,000 words should be able to read a word and return the related words having the same soundex 该程序应具有以下功能:能够读取大约50,000个单词,应该能够阅读一个单词并返回具有相同soundex的相关单词

I don't want the program implementation just few advices on what data structure to use. 我不希望程序实现只提供关于使用哪种数据结构的建议。

TIP: If you use SQL as a databackend then you can let SQL handle it with the two sql-functions SOUNDEX and DIFFERENCE. 提示:如果将SQL用作数据后端,则可以让SQL使用两个SQL函数SOUNDEX和DIFFERENCE处理它。

Maybe not what you wanted, but many people do not know that MSsql has those two functions. 也许不是您想要的,但是许多人不知道MSsql具有这两个功能。

Well soundex can be implemented in a straightforward pass over a string, so that doesn't require anything special. 好了soundex可以通过直接传递给字符串来实现,因此不需要任何特殊的操作。

After that the 4 character code can be treated as an integer key. 之后,可以将4个字符的代码视为整数键。

Then just build a dictionary that stores word sets indexed by that integer key. 然后,只需构建一个字典即可存储由该整数键索引的单词集。 50,000 words should easily fit into memory so nothing fancy is required. 50,000个单词应该很容易装入内存,因此不需要花哨的东西。

Then walk the dictionary and each bucket is a group of similar sounding words. 然后浏览字典,每个存储桶都是一组相似的发音的单词。

Actually, here is the whole program in perl: 实际上,这是perl中的整个程序:

#!/usr/bin/perl
use Text::Soundex;
use Data::Dumper;
open(DICT,"</usr/share/dict/linux.words");
my %dictionary = ();
while (<DICT>) {
        chomp();
        chomp();
        push @{$dictionary{soundex($_)}},$_;
}
close(DICT);
while (<>) {
        my @words = split / +/;
        foreach (@words) {
            print Dumper $dictionary{soundex($_)};
        }
}

I believe you just need to convert the original strings into soundex keys into a hashtable; 我相信您只需要将原始字符串转换为soundex键,再转换为哈希表即可; the value for each entry in the table would be a collection of original strings mapping to that soundex. 表中每个条目的值将是映射到该soundex的原始字符串的集合。

The MultiMap collection interface (and its implementations) in Google Collections would be useful to you. Google收藏夹中的MultiMap收藏界面(及其实现)将对您有用。

class SpellChecker
{

  interface Hash {
    String hash(String);
  }

  private final Hash hash;

  private final Map<String, Set<String>> collisions;

  SpellChecker(Hash hash) {
    this.hash = hash;
    collisions = new TreeSet<String, Set<String>>();
  }

  boolean addWord(String word) {
    String key = hash.hash(word);
    Set<String> similar = collisions.get(key);
    if (similar == null)
      collisions.put(key, similar = new TreeSet<String>());
    return similar.add(word);
  }

  Set<String> similar(String word) {
    Set<String> similar = collisions.get(hash.hash(word));
    if (similar == null)
      return Collections.emptySet();
    else
      return Collections.unmodifiableSet(similar);
  }

}

The hash strategy could be Soundex, Metaphone, or what have you. 哈希策略可以是Soundex,Metaphone或您拥有的东西。 Some strategies might be tunable (how many characters does it output, etc.) 有些策略可能是可调的(输出多少个字符,等等)。

由于soundex是哈希,因此我将使用以soundex为键的哈希表。

you want a 4-byte integer. 您需要一个4字节的整数。

The soundex algorithm always returns a 4-character code, if you use ANSI inputs, you'll get 4-bytes back (represented as 4 letters). soundex算法始终返回4个字符的代码,如果使用ANSI输入,则会返回4个字节(以4个字母表示)。

So store the codes returned in a hashtable, convert your word to the code and look it up in the hashtable. 因此,将返回的代码存储在哈希表中,将您的单词转换为代码,然后在哈希表中查找它。 Its really that easy. 真的很容易。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM