简体   繁体   English

附加在 for 循环中不适用于存储令牌列表

[英]Append in for-loop not working for storing the token lists

In the for loop below, I'm reading .dat files from a folder and parsing each file to extract the token list and then storing it in a list.在下面的 for 循环中,我从文件夹中读取 .dat 文件并解析每个文件以提取令牌列表,然后将其存储在列表中。 My code does this, but for individual files.我的代码执行此操作,但针对单个文件。 I have 1187 files, but the ud_file.append() just adds the tokens from the latest file, and ignores the tokens it appended in the earlier iteration.我有 1187 个文件,但 ud_file.append() 只是添加了最新文件中的标记,并忽略了它在早期迭代中附加的标记。 So, the list contains only the latest tokens and not all the tokens from the 1187 files.因此,该列表仅包含最新的令牌,而不是 1187 个文件中的所有令牌。 How should I fix this?我应该如何解决这个问题?

from io import open
from conllu import parse_incr
import os
import glob
import pandas as pd

#create a dict to store the results
word_lemma_dict = {}
ud_files = []
dat_files = []

#open the files and load the sentences to a list

datfolder = "Lemma/venv/Hindi corpus 2/CoNLL/utf" #Folder where all the .dat files are stored.

datfiles = glob.glob(os.path.join(datfolder, '*.dat'))

for file in datfiles:
    data_file = open(file, "r", encoding = "utf-8")
    for tokenlist in parse_incr(data_file):
         ud_files.append(tokenlist). #Only stores tokens from the latest file. Should ideally stores tokens from all the files it read in the for loop.

Here's the sample .dat file.这是示例 .dat 文件。 I have 1187 such files:我有 1187 个这样的文件:

 sent_id = dev-s1
# text = रामायण काल में भगवान राम के पुत्र कुश की राजधानी कुशावती को 483 ईसा पूर्व बुद्ध ने अपने अंतिम विश्राम के लिए चुना ।
1   रामायण  रामायण  PROPN   NNPC    Case=Nom|Gender=Masc|Number=Sing|Person=3   2   compound    _   Vib=0|Tam=0|ChunkId=NP|ChunkType=child|Translit=rāmāyaṇa
2   काल काल PROPN   NNP Case=Acc|Gender=Masc|Number=Sing|Person=3   23  obl _   Vib=0_में|Tam=0|ChunkId=NP|ChunkType=head|Translit=kāla
3   में में ADP PSP AdpType=Post    2   case    _   ChunkId=NP|ChunkType=child|Translit=meṁ
4   भगवान   भगवान   NOUN    NNC Case=Nom|Gender=Masc|Number=Sing|Person=3   5   compound    _   Vib=0|Tam=0|ChunkId=NP2|ChunkType=child|Translit=bhagavāna
5   राम राम PROPN   NNP Case=Acc|Gender=Masc|Number=Sing|Person=3   7   nmod    _   Vib=0_का|Tam=0|ChunkId=NP2|ChunkType=head|Translit=rāma
6   के  का  ADP PSP AdpType=Post|Case=Acc|Gender=Masc|Number=Sing   5   case    _   ChunkId=NP2|ChunkType=child|Translit=ke
7   पुत्र   पुत्र   NOUN    NN  Case=Acc|Gender=Masc|Number=Sing|Person=3   8   nmod    _   Vib=0|Tam=0|ChunkId=NP3|ChunkType=head|Translit=putra
8   कुश कुश PROPN   NNP Case=Acc|Gender=Masc|Number=Sing|Person=3   10  nmod    _   Vib=0_का|Tam=0|ChunkId=NP4|ChunkType=head|Translit=kuśa
9   की  का  ADP PSP AdpType=Post|Case=Acc|Gender=Fem|Number=Sing    8   case    _   ChunkId=NP4|ChunkType=child|Translit=kī
10  राजधानी राजधानी NOUN    NN  Case=Acc|Gender=Fem|Number=Sing|Person=3    11  nmod    _   Vib=0|Tam=0|ChunkId=NP5|ChunkType=head|Translit=rājadhānī
11  कुशावती कुशावती PROPN   NNP Case=Acc|Gender=Fem|Number=Sing|Person=3    23  obj _   Vib=0_को|Tam=0|ChunkId=NP6|ChunkType=head|Translit=kuśāvatī
12  को  को  ADP PSP AdpType=Post    11  case    _   ChunkId=NP6|ChunkType=child|Translit=ko
13  483 483 PROPN   NNPC    Case=Nom|Gender=Masc|Number=Sing|Person=3   15  compound    _   Vib=0|Tam=0|ChunkId=NP7|ChunkType=child|Translit=483
14  ईसा ईसा PROPN   NNPC    Case=Nom|Gender=Masc|Number=Sing|Person=3   15  compound    _   Vib=0|Tam=0|ChunkId=NP7|ChunkType=child|Translit=īsā
15  पूर्व   पूर्व   PROPN   NNP Case=Nom|Gender=Masc|Number=Sing|Person=3   23  obl _   Vib=0|Tam=0|ChunkId=NP7|ChunkType=head|Translit=pūrva
16  बुद्ध   बुद्ध   PROPN   NNP Case=Acc|Gender=Masc|Number=Sing|Person=3   23  nsubj   _   Vib=0_ने|Tam=0|ChunkId=NP8|ChunkType=head|Translit=buddha
17  ने  ने  ADP PSP AdpType=Post    16  case    _   ChunkId=NP8|ChunkType=child|Translit=ne
18  अपने    अपना    PRON    PRP Case=Acc|Gender=Masc|PronType=Prs   20  nmod    _   Vib=0|Tam=0|ChunkId=NP9|ChunkType=head|Translit=apane
19  अंतिम   अंतिम   ADJ JJ  Case=Acc    20  amod    _   ChunkId=NP10|ChunkType=child|Translit=aṁtima
20  विश्राम विश्राम NOUN    NN  Case=Acc|Gender=Masc|Number=Sing|Person=3   23  obl _   Vib=0_के_लिए|Tam=0|ChunkId=NP10|ChunkType=head|Translit=viśrāma
21  के  के  ADP PSP AdpType=Post    20  case    _   ChunkId=NP10|ChunkType=child|Translit=ke
22  लिए लिए ADP PSP AdpType=Post    20  case    _   ChunkId=NP10|ChunkType=child|Translit=lie
23  चुना    चुन VERB    VM  Aspect=Perf|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act 0   root    _   Vib=या|Tam=yA|ChunkId=VGF|ChunkType=head|Stype=declarative|Translit=cunā
24  ।   ।   PUNCT   SYM _   23  punct   _   ChunkId=BLK|ChunkType=head|Translit=.

# sent_id = dev-s2
# text = मल्‍लों की राजधानी होने के कारण प्राचीनकाल में इस स्‍थान का अत्‍यंत महत्‍व था ।
1   मल्‍लों मल्ला   NOUN    NN  Case=Acc|Gender=Masc|Number=Plur|Person=3   3   nmod    _   Vib=0_का|Tam=0|ChunkId=NP|ChunkType=head|Translit=malloṁ
2   की  का  ADP PSP AdpType=Post|Case=Nom|Gender=Fem|Number=Sing    1   case    _   ChunkId=NP|ChunkType=child|Translit=kī
3   राजधानी राजधानी NOUN    NN  Case=Nom|Gender=Fem|Number=Sing|Person=3    4   nsubj   _   Vib=0|Tam=0|ChunkId=NP2|ChunkType=head|Translit=rājadhānī
4   होने    हो  VERB    VM  Case=Acc|Gender=Masc|VerbForm=Inf   14  advcl   _   Vib=ना_के_कारण|Tam=nA|ChunkId=VGNN|ChunkType=head|Translit=hone
5   के  के  ADP PSP AdpType=Post|Case=Acc|Gender=Masc   4   mark    _   ChunkId=VGNN|ChunkType=child|Translit=ke
6   कारण    कारण    ADP PSP Case=Acc|Gender=Masc    4   mark    _   ChunkId=VGNN|ChunkType=child|Translit=kāraṇa
7   प्राचीनकाल  प्राचीनकाल  NOUN    NN  Case=Acc|Gender=Masc|Number=Sing|Person=3   14  obl _   Vib=0_में|Tam=0|ChunkId=NP3|ChunkType=head|Translit=prācīnakāla
8   में में ADP PSP AdpType=Post    7   case    _   ChunkId=NP3|ChunkType=child|Translit=meṁ
9   इस  यह  DET DEM Case=Acc|Number=Sing|Person=3|PronType=Dem  10  det _   ChunkId=NP4|ChunkType=child|Translit=isa
10  स्‍थान  स्थान   NOUN    NN  Case=Acc|Gender=Masc|Number=Sing|Person=3   13  nmod    _   Vib=0_का|Tam=0|ChunkId=NP4|ChunkType=head|Translit=sthāna
11  का  का  ADP PSP AdpType=Post|Case=Nom|Gender=Masc|Number=Sing   10  case    _   ChunkId=NP4|ChunkType=child|Translit=kā
12  अत्‍यंत अत्यंत  ADJ JJ  Case=Nom    13  amod    _   ChunkId=NP5|ChunkType=child|Translit=atyaṁta
13  महत्‍व  महत्व   NOUN    NN  Case=Nom|Gender=Masc|Number=Sing|Person=3   14  nsubj   _   Vib=0|Tam=0|ChunkId=NP5|ChunkType=head|Translit=mahatva
14  था  था  VERB    VM  Gender=Masc|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act  0   root    _   Vib=था|Tam=WA|ChunkId=VGF|ChunkType=head|Stype=declarative|Translit=thā
15  ।   ।   PUNCT   SYM _   14  punct   _   ChunkId=BLK|ChunkType=head|Translit=.

# sent_id = dev-s3
# text = बौद्ध धर्मावलंबियों के अनुसार लुंबनी, बोधगया और सारनाथ के साथ ही इस स्‍थान का विशद् महत्‍व है ।
1   बौद्ध   बौद्ध   PROPN   NNP Case=Nom|Gender=Masc|Number=Sing|Person=3   2   nmod    _   Vib=0|Tam=0|ChunkId=NP|ChunkType=child|Translit=bauddha
2   धर्मावलंबियों   धर्मावलंबी  NOUN    NN  Case=Acc|Gender=Masc|Number=Plur|Person=3   17  nmod    _   Vib=0_के_अनुसार|Tam=0|ChunkId=NP|ChunkType=head|Translit=dharmāvalaṁbiyoṁ
3   के  के  ADP PSP AdpType=Post    2   case    _   ChunkId=NP|ChunkType=child|Translit=ke
4   अनुसार  अनुसार  ADP PSP AdpType=Post    2   case    _   ChunkId=NP|ChunkType=child|Translit=anusāra
5   लुंबनी  लुंबनी  PROPN   NNP Case=Acc|Gender=Masc|Number=Sing|Person=3   17  nmod    _   SpaceAfter=No|Vib=0|Tam=0|ChunkId=NP2|ChunkType=head|Translit=luṁbanī
6   ,   COMMA   PUNCT   SYM _   7   punct   _   ChunkId=NP2|ChunkType=child|Translit=,
7   बोधगया  बोधगया  PROPN   NNP Case=Acc|Gender=Masc|Number=Sing|Person=3   5   conj    _   Vib=0|Tam=0|ChunkId=NP3|ChunkType=head|Translit=bodhagayā
8   और  और  CCONJ   CC  _   9   cc  _   ChunkId=CCP|ChunkType=head|Translit=aura
9   सारनाथ  सारनाथ  PROPN   NNP Case=Acc|Gender=Masc|Number=Sing|Person=3   5   conj    _   Vib=0_के_साथ|Tam=0|ChunkId=NP4|ChunkType=head|Translit=sāranātha
10  के  के  ADP PSP AdpType=Post    9   case    _   ChunkId=NP4|ChunkType=child|Translit=ke
11  साथ साथ ADP NST AdpType=Post|Case=Nom|Gender=Masc|Number=Sing|Person=3  9   case    _   AltTag=ADP-NOUN|ChunkId=NP4|ChunkType=child|Translit=sātha
12  ही  ही  PART    RP  _   9   dep _   ChunkId=NP4|ChunkType=child|Translit=hī
13  इस  यह  DET DEM Case=Acc|Number=Sing|Person=3|PronType=Dem  14  det _   ChunkId=NP5|ChunkType=child|Translit=isa
14  स्‍थान  स्थान   NOUN    NN  Case=Acc|Gender=Masc|Number=Sing|Person=3   17  nmod    _   Vib=0_का|Tam=0|ChunkId=NP5|ChunkType=head|Translit=sthāna
15  का  का  ADP PSP AdpType=Post|Case=Nom|Gender=Masc|Number=Sing   14  case    _   ChunkId=NP5|ChunkType=child|Translit=kā
16  विशद्   विशद्   ADJ JJ  Case=Nom    17  amod    _   ChunkId=NP6|ChunkType=child|Translit=viśad
17  महत्‍व  महत्व   NOUN    NN  Case=Nom|Gender=Masc|Number=Sing|Person=3   0   root    _   Vib=0|Tam=0|ChunkId=NP6|ChunkType=head|Translit=mahatva
18  है  है  AUX VM  Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act 17  cop _   Vib=है|Tam=hE|ChunkId=VGF|ChunkType=head|Stype=declarative|Translit=hai
19  ।   ।   PUNCT   SYM _   17  punct   _   ChunkId=BLK|ChunkType=head|Translit=.

Use the debugger and watch your datfiles variable.使用调试器并观察您的datfiles变量。 Are there really all file paths in?真的有所有文件路径吗? glob.glob does not work recursively by default unless you explicitly specify.除非您明确指定,否则glob.glob默认情况下不会递归工作。 You my want to give a shot for this:你我想为此试一试:

datfiles = glob.glob(os.path.join(datfolder, '**/*.dat'), recursive=True)

I was filing up a sample with only two text files in a test dir.我在测试目录中归档了一个只有两个文本文件的示例。 And I got it to work.我让它工作。 I'd recommend to start over with a new venv, beside that put your python script and 2 test files.我建议重新开始一个新的 venv,在旁边放置你的 python 脚本和 2 个测试文件。 Then run your code.然后运行你的代码。 It should do, mine did also.应该可以,我的也是。

Just a note: check your indentation and the '.'请注意:检查您的缩进和'.' on the last line (before the comment).在最后一行(在评论之前)。

tst.txt: tst.txt:

1  sifasf  ncadasfdv
2  asfdias  askfnhoas

tst1.txt: tst1.txt:

1  ddsds
2  asfdgasfg
3  asgas

the script:剧本:

#! /path/to/your/venv/python/interprter
from io import open
from conllu import parse_incr
import os
import glob

#create a dict to store the results
word_lemma_dict = {}
ud_files = []
dat_files = []

#open the files and load the sentences to a list

datfolder = "./" #Folder where all the .txt files are stored.

datfiles = glob.glob(os.path.join(datfolder, '*.txt'))
print(datfiles)

for file in datfiles:
  data_file = open(file, "r", encoding = "utf-8")
  for tokenlist in parse_incr(data_file):
    ud_files.append(tokenlist)

print(ud_files)

and the output:和输出:

['./tst1.txt', './tst.txt']
[TokenList<ddsds, asfdgasfg, asgas>, TokenList<sifasf, asfdias>]

I bet you can add more files and it will do...我敢打赌你可以添加更多文件,它会做...

I am guessing it's a path / join or grammar to the conllu parser issue.我猜这是conllu解析器问题的路径/连接或语法。

You might post some contents of your different *.dat files to be parsed to your expectation.您可能会发布不同*.dat文件的一些内容,以便按照您的预期进行解析。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM