简体   繁体   中英

python stemming words for local languages

I've some problem to stem words in my local language using rule based algorithm. so any body who are python literate can help me.

In my language some words are pluralized by repeating the first 2 or 3 characters(sounds).

For example

Diimaa (root word)  ==> Diddiimaa(plural word)
Adii (root word)   ==> Adadii(plural word)

so now i want my program to reject "Did" from the first example and "Ad" from the second example

the following is my code and it did not return any result

`def compput(mm):   
    vv=1
    for i in mm:
        if seevowel(i)==1:
            inxt=mm.index(i)+1
            if inxt<len(mm)-1 and seevowel(mm[inxt])==0: 
                vv=vv+1            
    return vv
def stemm_maker(tkn):
    for i in range(len(tkn)):
        if (i[0] == i[2] and i[1] == i[3]):
            stem = i[2:]
            if compput(stem) > 0:
                return stem
        elif ((i[0] == i[2] or i[0]== i[3]) and i[1] == i[4]):
            stem = i[3:]
            if compput(self) > 0:
                return stem
       else:
           return tkn
    print(stem)`

One way to attack this problem is with regular expressions.

Looking at these pairs (found here ):

adadii       adii
babaxxee     baxxee
babbareedaa  bareedaa
diddiimaa    diimaa
gaggaarii    gaarii
guguddaa     guddaa
hahhamaa     hamaa
hahapphii    happhii

the rule appears to be

if the word starts with XY...
then the reduplicated word is either XYXY... or XYXXY...

In the regex language this can be expressed as

^(.)(.)\1?(?=\1\2)

which means:

 char 1
 char 2
 maybe char 1
 followed by
    char 1
    char 2

Complete example:

test = {
    'adadii': 'adii',
    'babaxxee': 'baxxee',
    'babbareedaa': 'bareedaa',
    'diddiimaa': 'diimaa',
    'gaggaarii': 'gaarii',
    'guguddaa': 'guddaa',
    'hahhamaa': 'hamaa',
    'hahapphii': 'happhii',
}

import re

def singularize(word):
    m = re.match(r'^(.)(.)\1?(?=\1\2)', word)
    if m:
        return word[len(m.group(0)):]
    return word

for p, s in test.items():
    assert singularize(p) == s

This is the answer for my question posted on this page. I tried the following rule based code and it works correctly. I've checked my code with words assigned to jechoota

jechoota = "diddiimaa adadii babaxxee babbareedaa gaggaarii guguddaa hahhamaa hahapphii"

token = jechoota.split()
def stem(word):
    if(word[0] == word[2] and word[1] == word[3]):
        stemed = word[2:]
    elif(word[0] == word[2] and word[0] == word[3] and word[1] == word[4]):
        stemed = word[3:]
    return stemed
for i in token:
    print stem(i)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM