Python中的Soundex算法（作業幫助請求）

Question

美國人口普查局使用一種稱為“ soundex”的特殊編碼來查找有關某個人的信息。 soundex是基於姓氏發音而不是拼寫方式的姓氏（姓氏）編碼。 聽起來相同但拼寫不同的姓氏（例如SMITH和SMYTH）具有相同的代碼，並且一起歸檔。 開發了soundex編碼系統，以便您可以找到一個姓氏，即使它可能已經用各種拼寫形式記錄下來了。

在本實驗中，您將設計，編碼和記錄一個程序，該程序在輸入姓氏時會生成soundex代碼。 將提示用戶輸入姓氏，程序應輸出相應的代碼。

基本的Soundex編碼規則

姓氏的每種soundex編碼均由字母和三個數字組成。 使用的字母始終是姓氏的第一個字母。 根據以下所示的soundex指南，將數字分配給姓氏的其余字母。 如有必要，在末尾添加零，以始終生成四個字符的代碼。 忽略其他字母。

Soundex編碼指南

Soundex為各種輔音分配一個編號。 聽起來相似的輔音分配了相同的數字：

數字輔音

1 B，F，P，V 2 C，G，J，K，Q，S，X，Z 3 D，T 4 L 5 M，N 6 R

Soundex忽略字母A，E，I，O，U，H，W和Y。

遵循另外3條Soundex編碼規則。 一個好的程序設計將把這些實現為一個或多個單獨的功能。

規則1.帶雙字母的名字

如果姓氏包含任何雙字母，則應將其視為一個字母。 例如：

古鐵雷斯編碼為G362（G，T代表3，第一個R代表6，忽略第二個R，對Z代表2）。 規則2.並排字母的名稱具有相同的Soundex代碼編號

如果在soundex編碼指南中，姓氏並排的字母相同且編號不同，則應將其視為一個字母。 例子：

Pfister編碼為P236（忽略P，F，因為它被認為與P，S等於2，T等於3，R等於6）。

傑克遜編碼為J250（J，C表示2，K與C忽略相同，S與C忽略相同，N表示5，添加0）。

規則3.輔音分隔符

3.a. 如果元音（A，E，I，O，U）分隔了兩個具有相同soundex代碼的輔音，則對元音右側的輔音進行編碼。 例：

Tymczak編碼為T-522（M代表T，5，C代表Z，2，忽略（請參見上面的“ Side-by-Side”規則），K代表2）。 由於元音“ A”將Z和K分開，因此對K進行編碼。 3.b. 如果“ H”或“ W”將具有相同soundex代碼的兩個輔音分開，則右側的輔音不會被編碼。 例：

* Ashcraft編碼為A261（A，2代表S，C被忽略，因為與S相同，中間有H，R代表6，F代表1）。 它沒有編碼為A226。

到目前為止，這是我的代碼：

surname = raw_input("Please enter surname:")
outstring = ""

outstring = outstring + surname[0]
for i in range (1, len(surname)):
        nextletter = surname[i]
        if nextletter in ['B','F','P','V']:
            outstring = outstring + '1'

        elif nextletter in ['C','G','J','K','Q','S','X','Z']:
            outstring = outstring + '2'

        elif nextletter in ['D','T']:
            outstring = outstring + '3'

        elif nextletter in ['L']:
            outstring = outstring + '4'

        elif nextletter in ['M','N']:
            outstring = outstring + '5'

        elif nextletter in ['R']:
            outstring = outstring + '6'

print outstring

足夠執行要求的操作，我只是不確定如何編寫這三個規則。 那是我需要幫助的地方。 因此，感謝您的幫助。

Answer 1

我建議您嘗試以下方法。

存儲CurrentCoded和LastCoded變量以在附加到輸出之前使用
將系統分解為有用的功能，例如
1. 布爾值IsVowel（字符）
2. 整數編碼（字符）
3. 布爾值IsRule1（Char，Char）

一旦將其很好地分解，它就應該變得易於管理。

Answer 2

這幾乎不是完美的（例如，如果輸入不是以字母開頭，則會產生錯誤的結果），並且它不會將規則實現為可獨立測試的函數，因此它並不能真正解決作業問題。 但這就是我要實現的方式：

>>> def soundex_prepare(s):
        """Prepare string for Soundex encoding.

        Remove non-alpha characters (and the not-of-interest W/H/Y), 
        convert to upper case, and remove all runs of repeated letters."""
        p = re.compile("[^a-gi-vxz]", re.IGNORECASE)
        s = re.sub(p, "", s).upper()
        for c in set(s):
            s = re.sub(c + "{2,}", c, s)
        return s

>>> def soundex_encode(s):
        """Encode a name string using the Soundex algorithm."""
        result = s[0].upper()
        s = soundex_prepare(s[1:])
        letters = 'ABCDEFGIJKLMNOPQRSTUVXZ'
        codes   = '.123.12.22455.12623.122'
        d = dict(zip(letters, codes))
        prev_code=""
        for c in s:
            code = d[c]
            if code != "." and code != prev_code:
                result += code
         if len(result) >= 4: break
            prev_code = code
        return (result + "0000")[:4]

Answer 3

surname = input("Enter surname of the author: ") #asks user to input the author's surname

while surname != "": #initiates a while loop thats loops on as long as the input is not equal to an empty line

    str_ini = surname[0] #denotes the initial letter of the surname string
    mod_str1 = surname[1:] #denotes modified string excluding the first letter of the surname

    import re #importing re module to access the sub function
    mod_str2 = re.sub(r'[aeiouyhwAEIOUYHW]', '', mod_str1) #eliminating any instances of the given letters


    mod_str21 = re.sub(r'[bfpvBFPV]', '1', mod_str2)
    mod_str22 = re.sub(r'[cgjkqsxzCGJKQSXZ]', '2', mod_str21)
    mod_str23 = re.sub(r'[dtDT]', '3', mod_str22)
    mod_str24 = re.sub(r'[lL]', '4', mod_str23)
    mod_str25 = re.sub(r'[mnMN]', '5', mod_str24)
    mod_str26 = re.sub(r'[rR]', '6', mod_str25)
                #substituting given letters with specific numbers as required by the soundex algorithm

    mod_str3 = str_ini.upper()+mod_str26 #appending the surname initial with the remaining modified trunk

    import itertools #importing itertools module to access the groupby function
    mod_str4 = ''.join(char for char, rep in itertools.groupby(mod_str3))
                #grouping each character of the string into individual characters
                #removing sequences of identical numbers with a single number
                #joining the individually grouped characters into a string

    mod_str5 = (mod_str4[:4]) #setting character limit of the modified string upto the fourth place

    if len (mod_str5) == 1:
        print (mod_str5 + "000\n")
    elif len (mod_str5) == 2:
        print (mod_str5 + "00\n")
    elif len (mod_str5) == 3:
        print (mod_str5 + "0\n")
    else:
        print (mod_str5 + "\n")
                #using if, elif and else arguments for padding with trailing zeros

    print ("Press enter to exit") #specification for the interactor, to press enter (i.e., equivalent to a new line for breaking the while loop) when he wants to exit the program
    surname = input("Enter surname of the author: ") #asking next input from the user if he wants to carry on

exit(0) #exiting the program at the break of the while loop

Python中的Soundex算法（作業幫助請求）

問題描述

3 個解決方案

解決方案1
1 2009-10-26 17:58:15

解決方案2
0 2009-10-26 20:16:47

解決方案3
0 2018-11-15 22:09:01

Python中的Soundex算法（作業幫助請求）

問題描述

3 個解決方案

解決方案1 1 2009-10-26 17:58:15

解決方案2 0 2009-10-26 20:16:47

解決方案3 0 2018-11-15 22:09:01

解決方案1
1 2009-10-26 17:58:15

解決方案2
0 2009-10-26 20:16:47

解決方案3
0 2018-11-15 22:09:01