简体   繁体   English

Python中的Soundex算法(作业帮助请求)

[英]Soundex algorithm in Python (homework help request)

The US census bureau uses a special encoding called “soundex” to locate information about a person. 美国人口普查局使用一种称为“ soundex”的特殊编码来查找有关某个人的信息。 The soundex is an encoding of surnames (last names) based on the way a surname sounds rather than the way it is spelled. soundex是基于姓氏发音而不是拼写方式的姓氏(姓氏)编码。 Surnames that sound the same, but are spelled differently, like SMITH and SMYTH, have the same code and are filed together. 听起来相同但拼写不同的姓氏(例如SMITH和SMYTH)具有相同的代码,并且一起归档。 The soundex coding system was developed so that you can find a surname even though it may have been recorded under various spellings. 开发了soundex编码系统,以便您可以找到一个姓氏,即使它可能已经用各种拼写形式记录下来了。

In this lab you will design, code, and document a program that produces the soundex code when input with a surname. 在本实验中,您将设计,编码和记录一个程序,该程序在输入姓氏时会生成soundex代码。 A user will be prompted for a surname, and the program should output the corresponding code. 将提示用户输入姓氏,程序应输出相应的代码。

Basic Soundex Coding Rules 基本的Soundex编码规则

Every soundex encoding of a surname consists of a letter and three numbers. 姓氏的每种soundex编码均由字母和三个数字组成。 The letter used is always the first letter of the surname. 使用的字母始终是姓氏的第一个字母。 The numbers are assigned to the remaining letters of the surname according to the soundex guide shown below. 根据以下所示的soundex指南,将数字分配给姓氏的其余字母。 Zeroes are added at the end if necessary to always produce a four-character code. 如有必要,在末尾添加零,以始终生成四个字符的代码。 Additional letters are disregarded. 忽略其他字母。

Soundex Coding Guide Soundex编码指南

Soundex assigns a number for various consonants. Soundex为各种辅音分配一个编号。 Consonants that sound alike are assigned the same number: 听起来相似的辅音分配了相同的数字:

Number Consonants 数字辅音

1 B, F, P, V 2 C, G, J, K, Q, S, X, Z 3 D, T 4 L 5 M, N 6 R 1 B,F,P,V 2 C,G,J,K,Q,S,X,Z 3 D,T 4 L 5 M,N 6 R

Soundex disregards the letters A, E, I, O, U, H, W, and Y. Soundex忽略字母A,E,I,O,U,H,W和Y。

There are 3 additional Soundex Coding Rules that are followed. 遵循另外3条Soundex编码规则。 A good program design would implement these each as one or more separate functions. 一个好的程序设计将把这些实现为一个或多个单独的功能。

Rule 1. Names With Double Letters 规则1.带双字母的名字

If the surname has any double letters, they should be treated as one letter. 如果姓氏包含任何双字母,则应将其视为一个字母。 For example: 例如:

Gutierrez is coded G362 (G, 3 for the T, 6 for the first R, second R ignored, 2 for the Z). 古铁雷斯编码为G362(G,T代表3,第一个R代表6,忽略第二个R,对Z代表2)。 Rule 2. Names with Letters Side-by-Side that have the Same Soundex Code Number 规则2.并排字母的名称具有相同的Soundex代码编号

If the surname has different letters side-by-side that have the same number in the soundex coding guide, they should be treated as one letter. 如果在soundex编码指南中,姓氏并排的字母相同且编号不同,则应将其视为一个字母。 Examples: 例子:

Pfister is coded as P236 (P, F ignored since it is considered same as P, 2 for the S, 3 for the T, 6 for the R). Pfister编码为P236(忽略P,F,因为它被认为与P,S等于2,T等于3,R等于6)。

Jackson is coded as J250 (J, 2 for the C, K ignored same as C, S ignored same as C, 5 for the N, 0 added). 杰克逊编码为J250(J,C表示2,K与C忽略相同,S与C忽略相同,N表示5,添加0)。

Rule 3. Consonant Separators 规则3.辅音分隔符

3.a. 3.a. If a vowel (A, E, I, O, U) separates two consonants that have the same soundex code, the consonant to the right of the vowel is coded. 如果元音(A,E,I,O,U)分隔了两个具有相同soundex代码的辅音,则对元音右侧的辅音进行编码。 Example: 例:

Tymczak is coded as T-522 (T, 5 for the M, 2 for the C, Z ignored (see "Side-by-Side" rule above), 2 for the K). Tymczak编码为T-522(M代表T,5,C代表Z,2,忽略(请参见上面的“ Side-by-Side”规则),K代表2)。 Since the vowel "A" separates the Z and K, the K is coded. 由于元音“ A”将Z和K分开,因此对K进行编码。 3.b. 3.b. If "H" or "W" separate two consonants that have the same soundex code, the consonant to the right is not coded. 如果“ H”或“ W”将具有相同soundex代码的两个辅音分开,则右侧的辅音不会被编码。 Example: 例:

*Ashcraft is coded A261 (A, 2 for the S, C ignored since same as S with H in between, 6 for the R, 1 for the F). * Ashcraft编码为A261(A,2代表S,C被忽略,因为与S相同,中间有H,R代表6,F代表1)。 It is not coded A226. 它没有编码为A226。

So far this is my code: 到目前为止,这是我的代码:

surname = raw_input("Please enter surname:")
outstring = ""

outstring = outstring + surname[0]
for i in range (1, len(surname)):
        nextletter = surname[i]
        if nextletter in ['B','F','P','V']:
            outstring = outstring + '1'

        elif nextletter in ['C','G','J','K','Q','S','X','Z']:
            outstring = outstring + '2'

        elif nextletter in ['D','T']:
            outstring = outstring + '3'

        elif nextletter in ['L']:
            outstring = outstring + '4'

        elif nextletter in ['M','N']:
            outstring = outstring + '5'

        elif nextletter in ['R']:
            outstring = outstring + '6'

print outstring

sufficiently does what it is asked to, I am just not sure how to code the three rules. 足够执行要求的操作,我只是不确定如何编写这三个规则。 That is where I need help. 那是我需要帮助的地方。 So, any help is appreciated. 因此,感谢您的帮助。

I would suggest you try the following. 我建议您尝试以下方法。

  • Store a CurrentCoded and LastCoded variable to work with before appended to your output 存储CurrentCoded和LastCoded变量以在附加到输出之前使用
  • Break down the system into useful functions, such as 将系统分解为有用的功能,例如
    1. Boolean IsVowel(Char) 布尔值IsVowel(字符)
    2. Int Coded(Char) 整数编码(字符)
    3. Boolean IsRule1(Char, Char) 布尔值IsRule1(Char,Char)

Once you break it down nicely it should become easier to manage. 一旦将其很好地分解,它就应该变得易于管理。

This is hardly perfect (for instance, it produces the wrong result if the input doesn't start with a letter), and it doesn't implement the rules as independently-testable functions, so it's not really going to serve as an answer to the homework question. 这几乎不是完美的(例如,如果输入不是以字母开头,则会产生错误的结果),并且它不会将规则实现为可独立测试的函数,因此它并不能真正解决作业问题。 But this is how I'd implement it: 但这就是我要实现的方式:

>>> def soundex_prepare(s):
        """Prepare string for Soundex encoding.

        Remove non-alpha characters (and the not-of-interest W/H/Y), 
        convert to upper case, and remove all runs of repeated letters."""
        p = re.compile("[^a-gi-vxz]", re.IGNORECASE)
        s = re.sub(p, "", s).upper()
        for c in set(s):
            s = re.sub(c + "{2,}", c, s)
        return s

>>> def soundex_encode(s):
        """Encode a name string using the Soundex algorithm."""
        result = s[0].upper()
        s = soundex_prepare(s[1:])
        letters = 'ABCDEFGIJKLMNOPQRSTUVXZ'
        codes   = '.123.12.22455.12623.122'
        d = dict(zip(letters, codes))
        prev_code=""
        for c in s:
            code = d[c]
            if code != "." and code != prev_code:
                result += code
         if len(result) >= 4: break
            prev_code = code
        return (result + "0000")[:4]
surname = input("Enter surname of the author: ") #asks user to input the author's surname

while surname != "": #initiates a while loop thats loops on as long as the input is not equal to an empty line

    str_ini = surname[0] #denotes the initial letter of the surname string
    mod_str1 = surname[1:] #denotes modified string excluding the first letter of the surname

    import re #importing re module to access the sub function
    mod_str2 = re.sub(r'[aeiouyhwAEIOUYHW]', '', mod_str1) #eliminating any instances of the given letters


    mod_str21 = re.sub(r'[bfpvBFPV]', '1', mod_str2)
    mod_str22 = re.sub(r'[cgjkqsxzCGJKQSXZ]', '2', mod_str21)
    mod_str23 = re.sub(r'[dtDT]', '3', mod_str22)
    mod_str24 = re.sub(r'[lL]', '4', mod_str23)
    mod_str25 = re.sub(r'[mnMN]', '5', mod_str24)
    mod_str26 = re.sub(r'[rR]', '6', mod_str25)
                #substituting given letters with specific numbers as required by the soundex algorithm

    mod_str3 = str_ini.upper()+mod_str26 #appending the surname initial with the remaining modified trunk

    import itertools #importing itertools module to access the groupby function
    mod_str4 = ''.join(char for char, rep in itertools.groupby(mod_str3))
                #grouping each character of the string into individual characters
                #removing sequences of identical numbers with a single number
                #joining the individually grouped characters into a string

    mod_str5 = (mod_str4[:4]) #setting character limit of the modified string upto the fourth place

    if len (mod_str5) == 1:
        print (mod_str5 + "000\n")
    elif len (mod_str5) == 2:
        print (mod_str5 + "00\n")
    elif len (mod_str5) == 3:
        print (mod_str5 + "0\n")
    else:
        print (mod_str5 + "\n")
                #using if, elif and else arguments for padding with trailing zeros

    print ("Press enter to exit") #specification for the interactor, to press enter (i.e., equivalent to a new line for breaking the while loop) when he wants to exit the program
    surname = input("Enter surname of the author: ") #asking next input from the user if he wants to carry on

exit(0) #exiting the program at the break of the while loop

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM