简体繁体 English

推导字符串转换规则

[英]Deducing string transformation rules

原文 2011-09-28 03:58:20 3 2 algorithm/ machine-learning

I have a set of pairs of character strings, eg: 我有一组成对的字符串，例如：

abba - aba, haha - aha, baa - ba, exb - esp, xa - za abba-aba，haha-aha，baa-ba，exb-esp，xa-za

The second (right) string in the pair is somewhat similar to the first (left) string. 该对中的第二个（右）字符串与第一个（左）字符串有些相似。

That is, a character from the first string can be represented by nothing, itself or a character from a small set of characters. 也就是说，第一个字符串中的字符本身不能表示，也可以由一小部分字符表示。

There's no simple rule for this character-to-character mapping, although there are some patterns. 尽管存在一些模式，但是对于字符到字符的映射没有简单的规则。

Given several thousands of such string pairs, how do I deduce the transformation rules such that if I apply them to the left strings, I get the right strings? 给定成千上万个这样的字符串对，我该如何推导转换规则，以便将它们应用于左字符串时，我会得到正确的字符串？

The solution can be approximate, working correctly for, say, 80-95% of the strings. 该解决方案可以是近似的，对于80％至95％的字符串都可以正常工作。

Would you recommend to use some kind of a genetic algorithm? 您是否建议使用某种遗传算法？ If so, how? 如果是这样，怎么办？

2 个解决方案

If you could align the characters, or rather groups of characters, you could work out tables saying that aa => a, bb => z, and so on. 如果您可以对齐字符，或者更确切地说是字符组，则可以计算出如下表：aa => a，bb => z，依此类推。 If you had such tables, you could align the characters using http://en.wikipedia.org/wiki/Dynamic_time_warping . 如果您有这样的表格，则可以使用http://en.wikipedia.org/wiki/Dynamic_time_warping对齐字符。 One approach is therefore to guess an alignment (eg one for one, just as a starting point, or just align the first and last characters of each sequence), work out a translation table from that, use DTW to get a new alignment, work out a revised translation table, and iterate in that way. 因此，一种方法是猜测一个比对（例如，一个对一个，作为起点，或者仅比对每个序列的第一个和最后一个字符），从中得出一个翻译表，使用DTW获得新的比对，列出修改后的翻译表，然后以这种方式进行迭代。 Perhaps you could wrap this up with enough maths to show that there is some measure of optimality or probability that such passes increase, climbing to a local maximum. 也许您可以用足够的数学来总结一下，以表明存在某种程度的最优性或概率，这种通过率会增加，并达到局部最大值。

There is probably some way of doing this by modelling a Hidden Markov Model that generates both sequences simultaneously and then deriving rules from that model, but I would not chose this approach unless I was already familiar with HMMs and had software to use as a starting point that I was happy to modify. 可以通过建模隐马尔可夫模型（同时生成两个序列，然后从该模型导出规则）来实现此目的，但是除非我已经熟悉HMM并拥有可以用作起点的软件，否则我不会选择这种方法。我很高兴修改。

You can use text to speech to create sound waves. 您可以使用文字语音来创建声波。 then compare sound waves with other's and match them with percentages. 然后将声波与他人的声波进行比较，并将其与百分比进行匹配。

This is my theory how Google has such a advanced spell checker. 这是我的理论，Google如何使用这种高级拼写检查器。