简体   繁体   中英

Lists in Python (Using NLTK)

I'm trying to make a list of lists in the form of [[(the, cat), (cat, with), (with, fur)] [(the, dog), (dog, with), (with, ball).......etc] from a text file with the sentences in lines like:

the cat with fur \\n the dog with ball \\n

The problem I've been having is that somehow while I'm reading lines in the file, word by word, making the tuples (variable label) and creating the final list (variable connection) there're empty intances were connection goes to 0. Well, not actually 0 but the list shows up like [[], [], []]

This is the code for that part of the program: with open('corpus.txt', 'r') as f:

with open('corpus.txt', 'r') as f:
    for line in f:
        cnt = 0
        sa = nltk.word_tokenize(line)
        label[:] = []

        for i in sa:
            words.append(i)
            if cnt>0:
                try: label +=[(prev , i)]
                except: NameError
            prev = i 
            cnt = cnt + 1

        if label != []:
            connection += [label]
            print connection

I hope somebody understand my problem cuz it's driving me crazy and I'm running out of time. I just wanna know what am i doing wrong here so I can update my connection list in each loop without losing what I've saved before.

Thanks for your help

You can use nltk.bigrams to get your tuples without worrying about getting the boundary conditions just right. If words is a list of the words in a sentence, you get all the bigrams with

bigrams = nltk.bigrams(words)

I don't have NLTK installed, but see if this works for you:

with open('corpus.txt', 'r') as f:
    answer = []
    for line in f:
        cnt = 0
        sa = nltk.word_tokenize(line)
        answer.append([tuple([char, sa[i+1]]) for i,char in enumerate(sa[:-1])])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM