[英]Append in for-loop not working for storing the token lists
In the for loop below, I'm reading .dat files from a folder and parsing each file to extract the token list and then storing it in a list.在下面的 for 循环中,我从文件夹中读取 .dat 文件并解析每个文件以提取令牌列表,然后将其存储在列表中。 My code does this, but for individual files.
我的代码执行此操作,但针对单个文件。 I have 1187 files, but the ud_file.append() just adds the tokens from the latest file, and ignores the tokens it appended in the earlier iteration.
我有 1187 个文件,但 ud_file.append() 只是添加了最新文件中的标记,并忽略了它在早期迭代中附加的标记。 So, the list contains only the latest tokens and not all the tokens from the 1187 files.
因此,该列表仅包含最新的令牌,而不是 1187 个文件中的所有令牌。 How should I fix this?
我应该如何解决这个问题?
from io import open
from conllu import parse_incr
import os
import glob
import pandas as pd
#create a dict to store the results
word_lemma_dict = {}
ud_files = []
dat_files = []
#open the files and load the sentences to a list
datfolder = "Lemma/venv/Hindi corpus 2/CoNLL/utf" #Folder where all the .dat files are stored.
datfiles = glob.glob(os.path.join(datfolder, '*.dat'))
for file in datfiles:
data_file = open(file, "r", encoding = "utf-8")
for tokenlist in parse_incr(data_file):
ud_files.append(tokenlist). #Only stores tokens from the latest file. Should ideally stores tokens from all the files it read in the for loop.
Here's the sample .dat file.这是示例 .dat 文件。 I have 1187 such files:
我有 1187 个这样的文件:
sent_id = dev-s1
# text = रामायण काल में भगवान राम के पुत्र कुश की राजधानी कुशावती को 483 ईसा पूर्व बुद्ध ने अपने अंतिम विश्राम के लिए चुना ।
1 रामायण रामायण PROPN NNPC Case=Nom|Gender=Masc|Number=Sing|Person=3 2 compound _ Vib=0|Tam=0|ChunkId=NP|ChunkType=child|Translit=rāmāyaṇa
2 काल काल PROPN NNP Case=Acc|Gender=Masc|Number=Sing|Person=3 23 obl _ Vib=0_में|Tam=0|ChunkId=NP|ChunkType=head|Translit=kāla
3 में में ADP PSP AdpType=Post 2 case _ ChunkId=NP|ChunkType=child|Translit=meṁ
4 भगवान भगवान NOUN NNC Case=Nom|Gender=Masc|Number=Sing|Person=3 5 compound _ Vib=0|Tam=0|ChunkId=NP2|ChunkType=child|Translit=bhagavāna
5 राम राम PROPN NNP Case=Acc|Gender=Masc|Number=Sing|Person=3 7 nmod _ Vib=0_का|Tam=0|ChunkId=NP2|ChunkType=head|Translit=rāma
6 के का ADP PSP AdpType=Post|Case=Acc|Gender=Masc|Number=Sing 5 case _ ChunkId=NP2|ChunkType=child|Translit=ke
7 पुत्र पुत्र NOUN NN Case=Acc|Gender=Masc|Number=Sing|Person=3 8 nmod _ Vib=0|Tam=0|ChunkId=NP3|ChunkType=head|Translit=putra
8 कुश कुश PROPN NNP Case=Acc|Gender=Masc|Number=Sing|Person=3 10 nmod _ Vib=0_का|Tam=0|ChunkId=NP4|ChunkType=head|Translit=kuśa
9 की का ADP PSP AdpType=Post|Case=Acc|Gender=Fem|Number=Sing 8 case _ ChunkId=NP4|ChunkType=child|Translit=kī
10 राजधानी राजधानी NOUN NN Case=Acc|Gender=Fem|Number=Sing|Person=3 11 nmod _ Vib=0|Tam=0|ChunkId=NP5|ChunkType=head|Translit=rājadhānī
11 कुशावती कुशावती PROPN NNP Case=Acc|Gender=Fem|Number=Sing|Person=3 23 obj _ Vib=0_को|Tam=0|ChunkId=NP6|ChunkType=head|Translit=kuśāvatī
12 को को ADP PSP AdpType=Post 11 case _ ChunkId=NP6|ChunkType=child|Translit=ko
13 483 483 PROPN NNPC Case=Nom|Gender=Masc|Number=Sing|Person=3 15 compound _ Vib=0|Tam=0|ChunkId=NP7|ChunkType=child|Translit=483
14 ईसा ईसा PROPN NNPC Case=Nom|Gender=Masc|Number=Sing|Person=3 15 compound _ Vib=0|Tam=0|ChunkId=NP7|ChunkType=child|Translit=īsā
15 पूर्व पूर्व PROPN NNP Case=Nom|Gender=Masc|Number=Sing|Person=3 23 obl _ Vib=0|Tam=0|ChunkId=NP7|ChunkType=head|Translit=pūrva
16 बुद्ध बुद्ध PROPN NNP Case=Acc|Gender=Masc|Number=Sing|Person=3 23 nsubj _ Vib=0_ने|Tam=0|ChunkId=NP8|ChunkType=head|Translit=buddha
17 ने ने ADP PSP AdpType=Post 16 case _ ChunkId=NP8|ChunkType=child|Translit=ne
18 अपने अपना PRON PRP Case=Acc|Gender=Masc|PronType=Prs 20 nmod _ Vib=0|Tam=0|ChunkId=NP9|ChunkType=head|Translit=apane
19 अंतिम अंतिम ADJ JJ Case=Acc 20 amod _ ChunkId=NP10|ChunkType=child|Translit=aṁtima
20 विश्राम विश्राम NOUN NN Case=Acc|Gender=Masc|Number=Sing|Person=3 23 obl _ Vib=0_के_लिए|Tam=0|ChunkId=NP10|ChunkType=head|Translit=viśrāma
21 के के ADP PSP AdpType=Post 20 case _ ChunkId=NP10|ChunkType=child|Translit=ke
22 लिए लिए ADP PSP AdpType=Post 20 case _ ChunkId=NP10|ChunkType=child|Translit=lie
23 चुना चुन VERB VM Aspect=Perf|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act 0 root _ Vib=या|Tam=yA|ChunkId=VGF|ChunkType=head|Stype=declarative|Translit=cunā
24 । । PUNCT SYM _ 23 punct _ ChunkId=BLK|ChunkType=head|Translit=.
# sent_id = dev-s2
# text = मल्लों की राजधानी होने के कारण प्राचीनकाल में इस स्थान का अत्यंत महत्व था ।
1 मल्लों मल्ला NOUN NN Case=Acc|Gender=Masc|Number=Plur|Person=3 3 nmod _ Vib=0_का|Tam=0|ChunkId=NP|ChunkType=head|Translit=malloṁ
2 की का ADP PSP AdpType=Post|Case=Nom|Gender=Fem|Number=Sing 1 case _ ChunkId=NP|ChunkType=child|Translit=kī
3 राजधानी राजधानी NOUN NN Case=Nom|Gender=Fem|Number=Sing|Person=3 4 nsubj _ Vib=0|Tam=0|ChunkId=NP2|ChunkType=head|Translit=rājadhānī
4 होने हो VERB VM Case=Acc|Gender=Masc|VerbForm=Inf 14 advcl _ Vib=ना_के_कारण|Tam=nA|ChunkId=VGNN|ChunkType=head|Translit=hone
5 के के ADP PSP AdpType=Post|Case=Acc|Gender=Masc 4 mark _ ChunkId=VGNN|ChunkType=child|Translit=ke
6 कारण कारण ADP PSP Case=Acc|Gender=Masc 4 mark _ ChunkId=VGNN|ChunkType=child|Translit=kāraṇa
7 प्राचीनकाल प्राचीनकाल NOUN NN Case=Acc|Gender=Masc|Number=Sing|Person=3 14 obl _ Vib=0_में|Tam=0|ChunkId=NP3|ChunkType=head|Translit=prācīnakāla
8 में में ADP PSP AdpType=Post 7 case _ ChunkId=NP3|ChunkType=child|Translit=meṁ
9 इस यह DET DEM Case=Acc|Number=Sing|Person=3|PronType=Dem 10 det _ ChunkId=NP4|ChunkType=child|Translit=isa
10 स्थान स्थान NOUN NN Case=Acc|Gender=Masc|Number=Sing|Person=3 13 nmod _ Vib=0_का|Tam=0|ChunkId=NP4|ChunkType=head|Translit=sthāna
11 का का ADP PSP AdpType=Post|Case=Nom|Gender=Masc|Number=Sing 10 case _ ChunkId=NP4|ChunkType=child|Translit=kā
12 अत्यंत अत्यंत ADJ JJ Case=Nom 13 amod _ ChunkId=NP5|ChunkType=child|Translit=atyaṁta
13 महत्व महत्व NOUN NN Case=Nom|Gender=Masc|Number=Sing|Person=3 14 nsubj _ Vib=0|Tam=0|ChunkId=NP5|ChunkType=head|Translit=mahatva
14 था था VERB VM Gender=Masc|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act 0 root _ Vib=था|Tam=WA|ChunkId=VGF|ChunkType=head|Stype=declarative|Translit=thā
15 । । PUNCT SYM _ 14 punct _ ChunkId=BLK|ChunkType=head|Translit=.
# sent_id = dev-s3
# text = बौद्ध धर्मावलंबियों के अनुसार लुंबनी, बोधगया और सारनाथ के साथ ही इस स्थान का विशद् महत्व है ।
1 बौद्ध बौद्ध PROPN NNP Case=Nom|Gender=Masc|Number=Sing|Person=3 2 nmod _ Vib=0|Tam=0|ChunkId=NP|ChunkType=child|Translit=bauddha
2 धर्मावलंबियों धर्मावलंबी NOUN NN Case=Acc|Gender=Masc|Number=Plur|Person=3 17 nmod _ Vib=0_के_अनुसार|Tam=0|ChunkId=NP|ChunkType=head|Translit=dharmāvalaṁbiyoṁ
3 के के ADP PSP AdpType=Post 2 case _ ChunkId=NP|ChunkType=child|Translit=ke
4 अनुसार अनुसार ADP PSP AdpType=Post 2 case _ ChunkId=NP|ChunkType=child|Translit=anusāra
5 लुंबनी लुंबनी PROPN NNP Case=Acc|Gender=Masc|Number=Sing|Person=3 17 nmod _ SpaceAfter=No|Vib=0|Tam=0|ChunkId=NP2|ChunkType=head|Translit=luṁbanī
6 , COMMA PUNCT SYM _ 7 punct _ ChunkId=NP2|ChunkType=child|Translit=,
7 बोधगया बोधगया PROPN NNP Case=Acc|Gender=Masc|Number=Sing|Person=3 5 conj _ Vib=0|Tam=0|ChunkId=NP3|ChunkType=head|Translit=bodhagayā
8 और और CCONJ CC _ 9 cc _ ChunkId=CCP|ChunkType=head|Translit=aura
9 सारनाथ सारनाथ PROPN NNP Case=Acc|Gender=Masc|Number=Sing|Person=3 5 conj _ Vib=0_के_साथ|Tam=0|ChunkId=NP4|ChunkType=head|Translit=sāranātha
10 के के ADP PSP AdpType=Post 9 case _ ChunkId=NP4|ChunkType=child|Translit=ke
11 साथ साथ ADP NST AdpType=Post|Case=Nom|Gender=Masc|Number=Sing|Person=3 9 case _ AltTag=ADP-NOUN|ChunkId=NP4|ChunkType=child|Translit=sātha
12 ही ही PART RP _ 9 dep _ ChunkId=NP4|ChunkType=child|Translit=hī
13 इस यह DET DEM Case=Acc|Number=Sing|Person=3|PronType=Dem 14 det _ ChunkId=NP5|ChunkType=child|Translit=isa
14 स्थान स्थान NOUN NN Case=Acc|Gender=Masc|Number=Sing|Person=3 17 nmod _ Vib=0_का|Tam=0|ChunkId=NP5|ChunkType=head|Translit=sthāna
15 का का ADP PSP AdpType=Post|Case=Nom|Gender=Masc|Number=Sing 14 case _ ChunkId=NP5|ChunkType=child|Translit=kā
16 विशद् विशद् ADJ JJ Case=Nom 17 amod _ ChunkId=NP6|ChunkType=child|Translit=viśad
17 महत्व महत्व NOUN NN Case=Nom|Gender=Masc|Number=Sing|Person=3 0 root _ Vib=0|Tam=0|ChunkId=NP6|ChunkType=head|Translit=mahatva
18 है है AUX VM Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act 17 cop _ Vib=है|Tam=hE|ChunkId=VGF|ChunkType=head|Stype=declarative|Translit=hai
19 । । PUNCT SYM _ 17 punct _ ChunkId=BLK|ChunkType=head|Translit=.
Use the debugger and watch your datfiles
variable.使用调试器并观察您的
datfiles
变量。 Are there really all file paths in?真的有所有文件路径吗?
glob.glob
does not work recursively by default unless you explicitly specify.除非您明确指定,否则
glob.glob
默认情况下不会递归工作。 You my want to give a shot for this:你我想为此试一试:
datfiles = glob.glob(os.path.join(datfolder, '**/*.dat'), recursive=True)
I was filing up a sample with only two text files in a test dir.我在测试目录中归档了一个只有两个文本文件的示例。 And I got it to work.
我让它工作。 I'd recommend to start over with a new venv, beside that put your python script and 2 test files.
我建议重新开始一个新的 venv,在旁边放置你的 python 脚本和 2 个测试文件。 Then run your code.
然后运行你的代码。 It should do, mine did also.
应该可以,我的也是。
Just a note: check your indentation and the '.'
请注意:检查您的缩进和
'.'
on the last line (before the comment).在最后一行(在评论之前)。
tst.txt: tst.txt:
1 sifasf ncadasfdv
2 asfdias askfnhoas
tst1.txt: tst1.txt:
1 ddsds
2 asfdgasfg
3 asgas
the script:剧本:
#! /path/to/your/venv/python/interprter
from io import open
from conllu import parse_incr
import os
import glob
#create a dict to store the results
word_lemma_dict = {}
ud_files = []
dat_files = []
#open the files and load the sentences to a list
datfolder = "./" #Folder where all the .txt files are stored.
datfiles = glob.glob(os.path.join(datfolder, '*.txt'))
print(datfiles)
for file in datfiles:
data_file = open(file, "r", encoding = "utf-8")
for tokenlist in parse_incr(data_file):
ud_files.append(tokenlist)
print(ud_files)
and the output:和输出:
['./tst1.txt', './tst.txt']
[TokenList<ddsds, asfdgasfg, asgas>, TokenList<sifasf, asfdias>]
I bet you can add more files and it will do...我敢打赌你可以添加更多文件,它会做...
I am guessing it's a path / join or grammar to the conllu
parser issue.我猜这是
conllu
解析器问题的路径/连接或语法。
You might post some contents of your different *.dat
files to be parsed to your expectation.您可能会发布不同
*.dat
文件的一些内容,以便按照您的预期进行解析。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.