简体   繁体   English

Python中的列表(使用NLTK)

[英]Lists in Python (Using NLTK)

I'm trying to make a list of lists in the form of [[(the, cat), (cat, with), (with, fur)] [(the, dog), (dog, with), (with, ball).......etc] from a text file with the sentences in lines like: 我正在尝试以[[(the,cat),(cat,with),(with,fur)] [(the,dog),(dog,with),(with, ball)....... etc]来自一个文本文件,句子如下:

the cat with fur \\n the dog with ball \\n 带毛皮的猫\\带球的狗\\ n

The problem I've been having is that somehow while I'm reading lines in the file, word by word, making the tuples (variable label) and creating the final list (variable connection) there're empty intances were connection goes to 0. Well, not actually 0 but the list shows up like [[], [], []] 我一直在遇到的问题是,当我在文件中逐行读取行时,制作元组(变量标签)并创建最终列表(变量连接)时,会出现空的情况,连接变为0好吧,实际上不是0,但列表显示为[[],[],[]]

This is the code for that part of the program: with open('corpus.txt', 'r') as f: 这是该程序部分的代码:open('corpus.txt','r')为f:

with open('corpus.txt', 'r') as f:
    for line in f:
        cnt = 0
        sa = nltk.word_tokenize(line)
        label[:] = []

        for i in sa:
            words.append(i)
            if cnt>0:
                try: label +=[(prev , i)]
                except: NameError
            prev = i 
            cnt = cnt + 1

        if label != []:
            connection += [label]
            print connection

I hope somebody understand my problem cuz it's driving me crazy and I'm running out of time. 我希望有人能理解我的问题,因为它让我发疯,而且我已经没时间了。 I just wanna know what am i doing wrong here so I can update my connection list in each loop without losing what I've saved before. 我只是想知道我在这里做错了什么,所以我可以在每个循环中更新我的连接列表,而不会丢失之前保存的内容。

Thanks for your help 谢谢你的帮助

You can use nltk.bigrams to get your tuples without worrying about getting the boundary conditions just right. 您可以使用nltk.bigrams来获取元组,而无需担心边界条件恰到好处。 If words is a list of the words in a sentence, you get all the bigrams with 如果words是一个句子中words的列表,那么你就得到了所有的双字母组合

bigrams = nltk.bigrams(words)

I don't have NLTK installed, but see if this works for you: 我没有安装NLTK,但看看这是否适合你:

with open('corpus.txt', 'r') as f:
    answer = []
    for line in f:
        cnt = 0
        sa = nltk.word_tokenize(line)
        answer.append([tuple([char, sa[i+1]]) for i,char in enumerate(sa[:-1])])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM