简体   繁体   中英

Grammar NLTK for numbers

I'm coding something in order to analyse a list (or a dictionary/tuple) with elements which are strings or numbers. But i'm having an issue: I can analyse simple numbers (from 0 to 9) but not others. Here is my code:

grammaire = nltk.CFG.fromstring("""
    L -> OPEN CONTENT CLOSE
    OPEN -> "["
    CLOSE -> "]"
    CONTENT -> Element Seq |   
    Seq -> | S Element Seq
    S -> ","
    Element -> Word | nombre | T | L | D
    T -> "(" BeginTuple ")"
    BeginTuple -> ElementTuple S ElementTuple EndTuple
    EndTuple -> S ElementTuple |  
    ElementTuple -> nombre | T
    D -> "{" BeginDic "}"
    BeginDic -> ElementDic EndDic
    EndDic -> S ElementDic EndDic |
    ElementDic -> Key ":" Value
    Key -> Word
    Value -> nombre | T | L
    Word -> "Bonjour" | "Aurevoir" | "Bye" | "Cya" | "Coucou" | " " | "Hello" | "Hi" 
    nombre -> chiffre | chiffre nombre
    chiffre ->  '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
    """)

sent,res,elmt = "[{Bonjour:1,Hello:(1,2)}]",[],''
c = '()[]{}:,'
for x in sent:
    if x in c:
        if len(elmt) == 0:
            res += [x,]
        else:
            #try: res += [int(elmt),] #si c'est un nombre on le transforme en int
            #except: res += [elmt,]
            res += [elmt,]
            elmt = ""
            res += [x,]
    else:
        elmt += x
print(res)

The importants lines are in the beginning, with "chiffre" and "nombre". What am I doing wrong? Plus, I need to do the same with strings (so chiffre will be ' "a" | "b" | "c"... ' and nombre will be the same).

I tried to take in my list the numbers as Int and not as Str but it doesn't work... (cf the commented region with the try/except). Ofc then I draw the tree of that.

The narrow answer to your question is that your tokenizer groups multi-digit numbers as single tokens. If you tokenize each digit separately, it will work. More generally, you should tackle the task of tokenization more thoroughly; your code is too brittle to support things like treating quote-delimited strings as single tokens, for example.

However: Why are you trying to parse a string representation of an arbitrary python list? Don't do it. If you're reading data you wrote yourself, write it out in simpler form so that you can read it easily. Eg, does each record consist of a label and a list of numbers? Write each record as one space-delimited row. That's trivial to read in and parse.

For data with more complicated structure, use json to write out your file and read it back in. It handles all the parsing for you.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM