简体   繁体   English

Python将制表符分隔的双语txt拆分为两个单独的txt文件(列表),并使用换行符分隔字符串

[英]Python split tabspaced bilingual txt to two separate txt files (list) with newlines separating strings

I have a bi-lingual corpora (EN-JP) from tatoeba and want to split this into two separate files. 我有一个来自tatoeba的双语语料库(EN-JP),想将其拆分为两个单独的文件。 The strings have to say on the same line respectively. 字符串必须分别在同一行上说。

I need this for training an NMT in nmt-keras and training data has to be stored in separate files for each language. 在nmt-keras训练NMT时,我需要此工具,并且训练数据必须存储在每种语言的单独文件中。 I tried several approaches, but since I'm an absolute beginner with python and coding in general I feel like I'm running in circles. 我尝试了几种方法,但是由于我是python和编码方面的绝对初学者,所以我感觉自己像是在圈子里奔跑。

So far the best I managed was the following: 到目前为止,我管理得最好的是:

Source txt: 源txt:

Go. 行け。
Go. 行きなさい。
Hi. やっほー。
Hi. こんにちは!

Code: 码:

with open('jpns.txt', encoding="utf8") as f:
    columns = zip(*(l.split("\t") for l in f))

list1= list(columns)
print(list1)

[('Go.', 'Go.', 'Hi.', 'Hi.'), ('行け。\n', '行きなさい。\n', 'やっほー。\n', 'こんにちは!')]

Result with my code: 结果与我的代码:

[('Go.', 'Go.', 'Hi.', 'Hi.'), ('行け。\n', '行きなさい。\n', 'やっほー。\n', 'こんにちは!')]

English and Japanese get properly separated (into a Tuple?) but I'm stuck at figuring out how to export only English and how to export only Japanese to an output.en and an output.jp respecitvely. 英文和日文正确地分开了(分成一个元组?),但我一直想弄清楚如何仅导出英文,以及如何分别将日文仅导出到output.enoutput.jp

Expected result: 预期结果:

output.en 输出

Go.
Go.
Hi.
Hi.

output.jp output.jp

行け。
行きなさい。
やっほー。
こんにちは!

Each outputted strings should contain an \\n after the string. 每个输出的字符串应在字符串后包含\\ n。

Please keep in mind that I'm a total beginner with coding, so I'm not exactly sure what I did after "zip" as I just found this here on stackoverflow. 请记住,我是编码的初学者,所以我不确定在“ zip”之后我做了什么,因为我刚刚在stackoverflow上找到了它。 I'd be really gratful for a fully commented suggestion. 我真的很感激得到充分评论的建议。

The first thing to be aware of is that iterating over a file retains the newlines. 首先要注意的是,遍历文件会保留换行符。 That means that in your two columns, the first has no newlines, while the second has newlines already appended to each line (except possibly the last). 这意味着在您的两列中,第一列没有换行符,而第二列已经将换行符附加到每行(可能最后一行除外)。

Writing the second column is therefore trivial if you've already unpacked the generator columns : 因此,如果您已经解压缩了generator columns那么编写第二列就变得很简单了:

with open('output.jp', 'w') as f:
    f.writelines(list1[-1])

But you still have to append newlines to the first column (and possibly others if you go full-on multilingual). 但是您仍然必须在第一列中添加换行符(如果您使用多语言,则可能还要添加其他行)。 One way would be to append newlines to all the columns but the last. 一种方法是将换行符添加到除最后一行之外的所有列。 Another would be to strip the columns from the last column and process all of them the same. 另一个方法是从最后一列中删除这些列,并对它们进行相同的处理。

You can achieve the result you want with a small loop, and another call to zip : 您可以通过一个小循环以及另一个对zip调用来获得所需的结果:

langs = ('en', 'jp')
for index, (lang, data) in enumerate(zip(langs, columns)):
    with open('output.' + lang, 'w') as f:
        if index < len(langs) - 1:
            data = (line + '\n' for line in data)
        f.writelines(data)

This approach replaces the tuple data with a generator that appends newlines, unless we are at the last column. 除非我们在最后一列,否则此方法将使用附加换行符的生成器替换元组data

There are a couple of ways to insert newlines between each line in the output files. 有两种方法可以在输出文件的每一行之间插入换行符。 The one I show uses a lazy generator to append to each line individually. 我展示的是使用惰性生成器分别附加到每一行。 This should save a little memory. 这样可以节省一点内存。 If you don't care about memory savings, you can output the whole file as a single string: 如果您不关心内存节省,则可以将整个文件输出为单个字符串:

joiner = '\n' if index < len(langs) - 1 else ''
f.write(joiner.join(data))

You can even write the loop yourself and print to the file: 您甚至可以自己编写循环并print到文件:

for line in data:
    print(line, file=f, end='\n' if index < len(args) - 1 else '')

Addendum 附录

Let's also look at the line columns = zip(*(l.split("\\t") for l in f)) in detail, since it is a very common Python idiom for transposing nested lists, and is the key to getting the result you want. 我们还要详细查看columns = zip(*(l.split("\\t") for l in f)) ,因为这是转置嵌套列表的非常常见的Python习惯用法,并且是获取你想要的结果。

The generator expression l.split("\\t") for l in f is pretty straightforward: it splits each line in the file around tabs, giving you two elements, one in English, and one in Japanese. l.split("\\t") for l in f的生成器表达式l.split("\\t") for l in f非常简单:它将文件中的每一行都围绕在制表符之间,为您提供两个元素,一个为英文,一个为日文。 Adding a * in front of the generator expands it so that each two-element row becomes a separate argument to zip . 在生成器前面添加一个*进行扩展,以使每两个元素行成为zip的单独参数。 zip then re-combines the respective elements of each row, so you get a column of the English elements, and a column of the Japanese elements, effectively transposing your original "matrix". zip然后重新组合每行的各个元素,因此您将获得一列英文元素和一列日文元素,从而有效地转换了原始的“矩阵”。

The result is that columns is a generator over the columns. 结果是, columnscolumns上的生成器。 You can convert it to a list , but that is only necessary for viewing. 您可以将其转换为list ,但这仅是查看所必需的。 The generator will work fine for the code shown above. 对于上面显示的代码,生成器将正常工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM