[英]How to read a large tsv file in python and convert it to csv
I have a large tsv
file (around 12 GB) that I want to convert to a csv
file. 我有一个较大的
tsv
文件(大约12 GB),我想将其转换为csv
文件。 For smaller tsv
files, I use the following code, which works but is slow: 对于较小的
tsv
文件,我使用以下代码,该代码有效,但速度较慢:
import pandas as pd
table = pd.read_table(path of tsv file, sep='\t')
table.to_csv(path andname_of csv_file, index=False)
However, this code does not work for my large file, and the kernel resets in the middle. 但是,此代码不适用于我的大文件,并且内核在中间重置。
Is there any way to fix the problem? 有什么办法可以解决这个问题? Does anyone know if the task is doable with Dask instead of Pandas?
有谁知道该任务是否可以用Dask而不是Pandas完成?
I am using windows 10. 我正在使用Windows 10。
Instead of loading all lines at once in memory, you can read line by line and process them one after another: 您无需逐行读取所有行,而是逐行读取并逐个处理它们:
With Python 3.x:
使用
Python 3.x:
fs=","
table = str.maketrans('\t', fs)
fName = 'hrdata.tsv'
f = open(fName,'r')
try:
line = f.readline()
while line:
print(line.translate(table), end = "")
line = f.readline()
except IOError:
print("Could not read file: " + fName)
finally:
f.close()
Input (hrdata.tsv): 输入 (hrdata.tsv):
Name Hire Date Salary Sick Days remaining
Graham Chapman 03/15/14 50000.00 10
John Cleese 06/01/15 65000.00 8
Eric Idle 05/12/14 45000.00 10
Terry Jones 11/01/13 70000.00 3
Terry Gilliam 08/12/14 48000.00 7
Michael Palin 05/23/13 66000.00 8
Output: 输出:
Name,Hire Date,Salary,Sick Days remaining
Graham Chapman,03/15/14,50000.00,10
John Cleese,06/01/15,65000.00,8
Eric Idle,05/12/14,45000.00,10
Terry Jones,11/01/13,70000.00,3
Terry Gilliam,08/12/14,48000.00,7
Michael Palin,05/23/13,66000.00,8
Command: 命令:
python tsv_csv_convertor.py > new_csv_file.csv
Note: 注意:
If you use a Unix
env, just run the command: 如果您使用
Unix
,只需运行以下命令:
tr '\t' ',' <input.tsv >output.csv
Correct me if I'm wrong, but a TSV file is basically a CSV file, using a tab character instead of a comma. 如果我错了,请纠正我,但是TSV文件基本上是CSV文件,使用制表符而不是逗号。 To translate this in python efficiently, you need to iterate through the lines of your source file, replace the tabs with commas, and write the new line to the new file.
为了在python中有效地翻译,您需要遍历源文件的各行,用逗号替换选项卡,并将新行写入新文件。 You don't need to use any module to do this, writing the solution in Python is actually quite simple:
您不需要使用任何模块来执行此操作,实际上,用Python编写解决方案非常简单:
def tsv_to_csv(filename):
ext_index = filename.rfind('.tsv')
if ext_index == -1:
new_filename = filename + '.csv'
else:
new_filename = filename[:ext_index] + '.csv'
with open(filename) as original, open(new_filename, 'w') as new:
for line in original:
new.write(line.replace('\t', ','))
return new_filename
Iterating through the lines like this only loads each line into memory one by one, instead of loading the whole thing into memory. 像这样遍历这些行仅将每一行一个一个地加载到内存中,而不是将整个内容加载到内存中。 It might take a while to process 12GB of data though.
但是,可能需要一段时间才能处理12GB的数据。
UPDATE: In fact, now that I think about it, it may be significantly faster to use binary I/O on such a large file, and then to replace the tabs with commas on large chunks of the file at a time. 更新:实际上,现在考虑到这一点,在这样大的文件上使用二进制I / O,然后一次在文件的大块上用逗号替换选项卡可能会更快。 This code follows that strategy:
此代码遵循该策略:
from io import FileIO
# This chunk size loads 1MB at a time for conversion.
CHUNK_SIZE = 1 << 20
def tsv_to_csv_BIG(filename):
ext_index = filename.rfind('.tsv')
if ext_index == -1:
new_filename = filename + '.csv'
else:
new_filename = filename[:ext_index] + '.csv'
original = FileIO(filename, 'r')
new = FileIO(new_filename, 'w')
table = bytes.maketrans(b'\t', b',')
while True:
chunk = original.read(CHUNK_SIZE)
if len(chunk) == 0:
break
new.write(chunk.translate(table))
original.close()
new.close()
return new_filename
On my laptop using a 1GB TSV file, the first function takes 4 seconds to translate to CSV while the second function takes 1 second. 在使用1GB TSV文件的笔记本电脑上,第一个功能需要4秒才能转换为CSV,而第二个功能需要1秒。 Tuning the CHUNK_SIZE parameter might speed it up more if your storage can keep up, but 1MB seems to be the sweet spot for me.
如果您的存储空间可以保持不变,则调整CHUNK_SIZE参数可能会加快速度,但是1MB似乎是我的最佳选择。
Using tr
as mentioned in another answer took 3 seconds for me, so the chunked python approach seems fastest. 在另一个答案中提到使用
tr
花费了我3秒钟的时间,因此分块的python方法似乎最快。
You can use Python's built-in read
and write
to rewrite the file line by line. 您可以使用Python的内置
read
和write
重写由行的文件行。 This may take some time to process depending on your file size, but it shouldn't run out of memory since you're working line by line. 根据您的文件大小,这可能需要花费一些时间来处理,但是由于您正在逐行工作,因此它不应耗尽内存。
with open("input.tsv", "r") as input_file:
for line in input_file:
with open("output.csv", "a") as output:
line = line.replace("\t", ",")
output.write(line)
You can use chunksize
to iterate over the entire file in pieces. 您可以使用
chunksize
遍历整个文件。 Note that this uses .read_csv()
instead of .read_table()
请注意,这使用
.read_csv()
而不是.read_table()
df = pd.DataFrame()
for chunk in pd.read_csv('Check1_900.csv', header=None, names=['id', 'text', 'code'], chunksize=1000):
df = pd.concat([df, chunk], ignore_index=True)
You can also try the low_memory=False
flag ( source ). 您也可以尝试
low_memory=False
标志( source )。
And then next would be the memory_map
(scroll down at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html ) 接下来是
memory_map
(向下滚动至https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html )
memory_map : bool, default False
memory_map :布尔值,默认为False
If a filepath is provided for filepath_or_buffer, map the file object directly onto memory and access the data directly from there.
如果为filepath_or_buffer提供了文件路径,则将文件对象直接映射到内存中,然后直接从那里访问数据。 Using this option can improve performance because there is no longer any I/O overhead.
使用此选项可以提高性能,因为不再有任何I / O开销。
Note that to_csv()
has similar functionality. 请注意,
to_csv()
具有类似的功能。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.