簡體 English 中英

如何在保留 alignment 的同時拆分並行語料庫？

[英]How to split parallel corpora while keeping alignment?

原文 2019-11-13 15:15:20 6 1 python/ pandas/ unix/ scikit-learn/ dataset

我有兩個文本文件，其中包含兩種語言的並行文本（可能有數百萬行）。 我正在嘗試從該單個文件生成隨機訓練/驗證/測試文件，就像train_test_split在sklearn中所做的那樣。 但是，當我嘗試使用read_csv將其導入 pandas 時，由於其中的數據錯誤，我從許多行中收到錯誤，嘗試修復斷線的工作量太大。 如果我嘗試設置error_bad_lines=false ，那么它將跳過其中一個文件中的一些行，而可能不會跳過另一個文件，這會破壞alignment。 如果我使用 unix 手動split它，它可以很好地滿足我的需要，所以我不關心清理它，但返回的數據不是隨機的。
我應該如何 go 將此數據集拆分為訓練/驗證/測試集？
我正在使用 python 但如果這樣更容易，我也可以使用 linux 命令。

1 個解決方案

我發現我可以在帶有random-source參數的文件上使用shuf命令，例如shuf tgt-full.txt -o tgt-fullshuf.txt --random-source=tgt-full.txt 。

如何使用多個語料庫文件在Python中的Watson語言翻譯器中用作並行語料庫

[英]How to use multiple corpora files to use as parallel corpora in Watson Language Translator in Python

如何在保持空白的同時拆分？

[英]How to split while keeping the empty line?

如何在保持 \\n 的同時拆分字符串

[英]How to split string while keeping \n

拆分.csv，同時將描述保留為第一行

[英]Split .csv while keeping description first row

在保留分隔符的同時拆分列表中的元素

[英]Split element in list while keeping delimiter

Pandas - 在保留索引的同時將列拆分為行

[英]Pandas - Split columns into rows while keeping indices

調整對齊樹狀圖和matplotlib pcolor子圖的大小，同時保持對齊

[英]Resizing scipy dendrogram and matplotlib pcolor subplots while keeping alignment

如何拆分字符串並保持模式

[英]How to split a string and keeping the pattern

如何通過沒有空格的 substring 分割字符串，同時保留其原始空格？

[英]How to split a string by a substring without white spaces, while keeping its original white spaces?

Python 3 - 如何將字符串中的每個字符拆分為列表，同時保持十進制數字不變？

[英]Python 3 - How to split every character in a string into a list while keeping decimal numbers intact?

暫無

暫無

聲明:本站的技術帖子網頁，遵循CC BY-SA 4.0協議，如果您需要轉載，請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

相關問題 如何使用多個語料庫文件在Python中的Watson語言翻譯器中用作並行語料庫如何在保持空白的同時拆分？如何在保持 \\n 的同時拆分字符串拆分.csv，同時將描述保留為第一行在保留分隔符的同時拆分列表中的元素 Pandas - 在保留索引的同時將列拆分為行調整對齊樹狀圖和matplotlib pcolor子圖的大小，同時保持對齊如何拆分字符串並保持模式如何通過沒有空格的 substring 分割字符串，同時保留其原始空格？ Python 3 - 如何將字符串中的每個字符拆分為列表，同時保持十進制數字不變？

相關標簽

粵ICP備18138465號 © 2020-2024 STACKOOM.COM