簡體   English   中英

IndexError:讀取大的.txt文件時,Python的字符串索引超出范圍

[英]IndexError: string index out of range with Python when reading big .txt file

我正在嘗試使用Python創建初學者級別的程序但是在閱讀大的.txt文件時出現以下錯誤:

Traceback (most recent call last):
  File "P4.py", line 58, in <module>
    maximo = diccionario.get(keyword[1]) #maximo is a variable for the maximum number of ocurrences in a keyword
IndexError: string index out of range

對於小文檔,程序工作正常,但對於課堂上提供的文檔(> 200000行,~2 / 3Mb),我收到錯誤。

這是我制作的代碼:

file=open("lexic.txt", "r") # abrimos el fichero lexic (nuestro modelo) (probar con este)
data=file.readlines()
file.close()
diccionario = {}

"""
In this portion of code we iterate the lines of the .txt document and we create a dictionary with a word as a key and a List as a value
Key: word
Value: List ([tag, #ocurrencesWithTheTag])
"""
for linea in data:
    aux = linea.decode('latin_1').encode('utf-8')
    sintagma = aux.split('\t')  # Here we separate the String in a list: [word, tag, ocurrences], word=sintagma[0], tag=sintagma[1], ocurrences=sintagma[2]
    if (sintagma[0] != "Palabra" and sintagma[1] != "Tag"): #We are not interested in the first line of the file, this is the filter
        if (diccionario.has_key(sintagma[0])): #Here we check it the word was included before in the dictionary
            aux_list = diccionario.get(sintagma[0]) #We know the name already exists in the dic, so we create a List for every value
            aux_list.append([sintagma[1], sintagma[2]]) #We add to the list the tag and th ocurrences for this concrete word
            diccionario.update({sintagma[0]:aux_list}) #Update the value with the new list (new list = previous list + new appended element to the list)
        else: #If in the dic do not exist the key, que add the values to the empty list (no need to append)
            aux_list_else = ([sintagma[1],sintagma[2]])
            diccionario.update({sintagma[0]:aux_list_else})

"""
Here we create a new dictionary based on the dictionary created before, in this new dictionary (diccionario2) we want to keep the next
information:
Key: word
Value: List ([suggestedTag, #ocurrencesOfTheWordInTheDocument, probability])

For retrieve the information from diccionario, we have to keep in mind:

In case we have more than 1 Tag associated to a word (keyword), we access to the first tag with keyword[0], and for ocurrencesWithTheTag with keyword[1],
from the second case and forward, we access to the information by this way:

diccionario.get(keyword)[2][0] -> with this we access to the second tag
diccionario.get(keyword)[2][1] -> with this we access to the second ocurrencesWithTheTag
diccionario.get(keyword)[3][0] -> with this we access to the third tag
...
..
.
etc.
"""
diccionario2 = dict.fromkeys(diccionario.keys())#We create a dictionary with the keys from diccionario and we set all the values to None
for keyword in diccionario:
    tagSugerido = diccionario.get(keyword[0]) #tagSugerido is the tag with more ocurrences for a concrete keyword
    maximo = diccionario.get(keyword[1]) #maximo is a variable for the maximum number of ocurrences in a keyword
    if ((len(diccionario.get(keyword))) > 2): #in case we have > 2 tags for a concrete word
        suma = float(diccionario.get(keyword)[1])
        for i in range (2, len(diccionario.get(keyword))):
            suma += float(diccionario.get(keyword)[i][1])
            if (diccionario.get(keyword)[i][1] > maximo):
                tagSugerido = diccionario.get(keyword)[i][0]
                maximo = float(diccionario.get(keyword)[i][1])
        probabilidad = float(maximo/suma);
        diccionario2.update({keyword:([tagSugerido, suma, probabilidad])})

    else:
        diccionario2.update({keyword:([diccionario.get(keyword)[0],diccionario.get(keyword)[1], 1])})

最后這里是一個輸入樣本(想象相同的200000行):

Palabra Tag Apariciones
Jordi_Savall    NP  5
LIma    NP  3
LIma    NC  8
LIma    V   65
Participaron    V   1
Tejkowski   NP  1
Tejkowski   NC  400
Tejkowski   V   23
Iglesia_Catolica    NP  1
Feria_Internacional_del_Turismo NP  4
38,5    Num 3
concertada  Adj 7
ríspida Adj 1
8.035   Num 1
José_Luis_Barbagelata   NP  1
lunes_tres  Data    1
misionero   NC  1
457.500 Num 1
El_Goloso   NP  1
suplente    NC  7
colocada    Adj 18
Frankfurter_Allgemeine  NP  2
reducía V   2
descendieron    V   21
escuela NC  113
.56 Num 9
curativos   Adj 1
Varios  Pron    5
delincuencia    NC  48
ratito  NC  1
conservamos V   1
dirigí  V   1
CECA    NP  6
formación   NC  317
experiencias    NC  48

根據您的意見。 你寫的是:

create a dictionary with *a word as a key* and a List as a value

所以你在字典diccionario中的diccionario是一個單詞。 但是在你的第二個for循環中,你有這個:

for keyword in diccionario:
    tagSugerido = diccionario.get(keyword[0]) 
    maximo = diccionario.get(keyword[1]) 

這意味着您使用實際關鍵字的第一個字母(即關鍵字[0])(根據您的注釋,這是一個單詞),然后使用關鍵字的第二個字母(即關鍵字[1])來查找字典中的值。 我認為這是不正確的。 如果關鍵字對於某些行只有一個字母,那么keyword[1]似乎也不在索引中。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM