简体   繁体   English

从大的.txt文件生成模型读取语料库时出错

[英]Error generating a model reading corpus from a big .txt file

i'm trying to read the file corpus.txt (training set) and generate a model, the output must be called lexic.txt and contain the word, the tag and the number of ocurrences...for small training sets it works, but for the university given training set (30mb .txt file, millions of lines) the code does not work,I imagine it will be a problem with the efficiency and therefore the system runs out of memory...can anybody help me with the code please? 我正在尝试读取文件corpus.txt(训练集)并生成一个模型,输出必须被称为lexic.txt并包含单词,标签和发生次数...对于它可以工作的小型训练集,但是对于大学给出的训练集(30mb .txt文件,数百万行),代码不起作用,我想这将是效率的问题,因此系统耗尽内存...任何人都可以帮助我代码好吗?

Here I attach my code: 在这里我附上我的代码:

from collections import Counter

file=open('corpus.txt','r')
data=file.readlines()
file.close()

palabras = []
count_list = []

for linea in data:
   linea.decode('latin_1').encode('UTF-8') # para los acentos
   palabra_tag = linea.split('\n')
   palabras.append(palabra_tag[0])

cuenta = Counter(palabras) # dictionary for count ocurrences for a word + tag 

#Assign for every word + tag the number of times appears
for palabraTag in palabras:
    for i in range(len(palabras)):
        if palabras[i] == palabraTag:       
            count_list.append([palabras[i], str(cuenta[palabraTag])])


#We delete repeated ones
finalList = []
for i in count_list:
    if i not in finalList:
        finalList.append(i)


outfile = open('lexic.txt', 'w') 
outfile.write('Palabra\tTag\tApariciones\n')

for i in range(len(finalList)):
    outfile.write(finalList[i][0]+'\t'+finalList[i][1]+'\n') # finalList[i][0] is the word + tag and finalList[i][1] is the numbr of ocurrences

outfile.close()

And here you can see a sample of the corpus.txt: 在这里你可以看到一个corpus.txt的样本:

Al  Prep
menos   Adv
cinco   Det
reclusos    Adj
murieron    V
en  Prep
las Det
últimas Adj
24  Num
horas   NC
en  Prep
las Det
cárceles    NC
de  Prep
Valencia    NP
y   Conj
Barcelona   NP
en  Prep
incidentes  NC
en  Prep
los Det
que Pron
su  Det

Thanks in advance! 提前致谢!

You may be able to reduce your memory usage if you combine these two chunks of code. 如果将这两个代码块组合在一起,则可以减少内存使用量。

#Assign for every word + tag the number of times appears
for palabraTag in palabras:
    for i in range(len(palabras)):
        if palabras[i] == palabraTag:       
            count_list.append([palabras[i], str(cuenta[palabraTag])])


#We delete repeated ones
finalList = []
for i in count_list:
    if i not in finalList:
        finalList.append(i) 

You can check to see if an item exists in the count list already and by doing so, not add duplicates in the first place. 您可以检查计数列表中是否存在某个项目,并且通过这样做,而不是首先添加重复项。 This should reduce your memory usage. 这应该会减少你的内存使用量。 See below; 见下文;

#Assign for every word + tag the number of times appears
for palabraTag in palabras:
    for i in range(len(palabras)):
        if palabras[i] == palabraTag and
           [palabras[i], str(cuenta[palabraTag])] not in count_list:
                count_list.append([palabras[i], str(cuenta[palabraTag])])

Finally I improved the code using dictionary, here is the result working 100% fine: 最后我使用字典改进了代码,这里结果100%正常工作:

file=open('corpus.txt','r')
data=file.readlines()
file.close()

diccionario = {}

for linea in data:
    linea.decode('latin_1').encode('UTF-8') # para los acentos
    palabra_tag = linea.split('\n')
    cadena = str(palabra_tag[0])
    if(diccionario.has_key(cadena)):
        aux = diccionario.get(cadena)
        aux += 1
        diccionario.update({cadena:aux})
    else:
        diccionario.update({cadena:1})

outfile = open('lexic.txt', 'w')
outfile.write('Palabra\tTag\tApariciones\n')

for key, value in diccionario.iteritems() :
    s = str(value)
    outfile.write(key +" "+s+'\n')
outfile.close()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM