简体   繁体   English

如何根据行数拆分TSV文件

[英]How to split a TSV file based on the no of rows

I need to split a tsv with 400000 rows into 4 csv files with 100000 rows. 我需要将具有400000行的tsv拆分为具有100000行的4个csv文件。

My sample code: 我的示例代码:

csvfile = open('./world_formatted.tsv', 'r').readlines()
filename = 1
for i in range(len(csvfile)):
    if i % 100000 == 0:
        open(str(filename) + '.tsv', 'w+').writelines(csvfile[i:i+100000])
        filename += 1

I am getting this error: 我收到此错误:

'charmap' codec can't decode byte 0x8d in position 7316: character maps to <undefined>

You might try to use open with the encoding= named parameter, so that Python knows which encoding to read. 您可能会尝试使用带有encoding= named参数的open ,以便Python知道要读取哪种编码。

Without knowing this (looks like a Windows-CP1252 file according to the hex code, but I might be wrong) you're basically out of luck. 不知道这一点(根据十六进制代码看起来像Windows-CP1252文件,但我可能错了),您基本上是不走运的。 On *nix oder MacOS you can use the file command that tries to make an educated guess of the encoding. 在Mac OS X上,您可以使用file命令尝试对编码进行有根据的猜测。

Second, you should probably not try to read everything in a list with readlines() . 其次,您可能不应该尝试使用readlines()读取列表中的所有内容。 For really large files this is a memory hog. 对于很大的文件,这是一个内存消耗。 Better stream-read thru the file by iterating as shown below. 如下所示,通过迭代可以更好地流式读取文件。

MAXLINES = 100000

csvfile = open('./world_formatted.tsv', mode='r', encoding='utf-8')
# or 'Latin-1' or 'CP-1252'
filename = 0
for rownum, line in enumerate(csvfile):
    if rownum % MAXLINES == 0:
        filename += 1
        outfile = open(str(filename) + '.tsv', mode='w', encoding='utf-8')
    outfile.write(line)
outfile.close()
csvfile.close()

I'm sure you close the files after running, just added it to be sure. 我确定您在运行后关闭了文件,请确保添加。 :-) :-)

If you are on a *nix'ish operating system (or MacOS) you might want to check out the split command that does exactly this (and more): How to split a large text file into smaller files with equal number of lines? 如果您使用的是* nix'ish操作系统(或MacOS),则可能要签出split命令来执行此操作(以及更多操作): 如何将大文本文件拆分为行数相等的较小文件?

csvfile = open('./formatted.tsv', 'r',encoding="ISO-8859-1").readlines()

filename = 1
for i in range(len(csvfile)):
    if i % 100000 == 0:
        open(str(filename) + '.tsv', 'w+',encoding="ISO-8859-1").writelines(csvfile[i:i+100000])
        filename += 1

This is the answer for the question, Thank you all for the help. 这是问题的答案,谢谢大家的帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM