简体   繁体   English

如何在python中读取大型tsv文件并将其转换为csv

[英]How to read a large tsv file in python and convert it to csv

I have a large tsv file (around 12 GB) that I want to convert to a csv file. 我有一个较大的tsv文件(大约12 GB),我想将其转换为csv文件。 For smaller tsv files, I use the following code, which works but is slow: 对于较小的tsv文件,我使用以下代码,该代码有效,但速度较慢:

import pandas as pd

table = pd.read_table(path of tsv file, sep='\t')
table.to_csv(path andname_of csv_file, index=False)

However, this code does not work for my large file, and the kernel resets in the middle. 但是,此代码不适用于我的大文件,并且内核在中间重置。

Is there any way to fix the problem? 有什么办法可以解决这个问题? Does anyone know if the task is doable with Dask instead of Pandas? 有谁知道该任务是否可以用Dask而不是Pandas完成?

I am using windows 10. 我正在使用Windows 10。

Instead of loading all lines at once in memory, you can read line by line and process them one after another: 您无需逐行读取所有行,而是逐行读取并逐个处理它们:

With Python 3.x: 使用Python 3.x:

fs=","
table = str.maketrans('\t', fs)
fName = 'hrdata.tsv'
f = open(fName,'r')

try:
  line = f.readline()
  while line:
    print(line.translate(table), end = "")
    line = f.readline()

except IOError:
  print("Could not read file: " + fName)

finally:
  f.close()

Input (hrdata.tsv): 输入 (hrdata.tsv):

Name    Hire Date       Salary  Sick Days remaining
Graham Chapman  03/15/14        50000.00        10
John Cleese     06/01/15        65000.00        8
Eric Idle       05/12/14        45000.00        10
Terry Jones     11/01/13        70000.00        3
Terry Gilliam   08/12/14        48000.00        7
Michael Palin   05/23/13        66000.00        8

Output: 输出:

Name,Hire Date,Salary,Sick Days remaining
Graham Chapman,03/15/14,50000.00,10
John Cleese,06/01/15,65000.00,8
Eric Idle,05/12/14,45000.00,10
Terry Jones,11/01/13,70000.00,3
Terry Gilliam,08/12/14,48000.00,7
Michael Palin,05/23/13,66000.00,8

Command: 命令:

python tsv_csv_convertor.py > new_csv_file.csv

Note: 注意:

If you use a Unix env, just run the command: 如果您使用Unix ,只需运行以下命令:

tr '\t' ',' <input.tsv >output.csv

Correct me if I'm wrong, but a TSV file is basically a CSV file, using a tab character instead of a comma. 如果我错了,请纠正我,但是TSV文件基本上是CSV文件,使用制表符而不是逗号。 To translate this in python efficiently, you need to iterate through the lines of your source file, replace the tabs with commas, and write the new line to the new file. 为了在python中有效地翻译,您需要遍历源文件的各行,用逗号替换选项卡,并将新行写入新文件。 You don't need to use any module to do this, writing the solution in Python is actually quite simple: 您不需要使用任何模块来执行此操作,实际上,用Python编写解决方案非常简单:

def tsv_to_csv(filename):
    ext_index = filename.rfind('.tsv')
    if ext_index == -1:
        new_filename = filename + '.csv'
    else:
        new_filename = filename[:ext_index] + '.csv'

    with open(filename) as original, open(new_filename, 'w') as new:
        for line in original:
            new.write(line.replace('\t', ','))

    return new_filename

Iterating through the lines like this only loads each line into memory one by one, instead of loading the whole thing into memory. 像这样遍历这些行仅将每一行一个一个地加载到内存中,而不是将整个内容加载到内存中。 It might take a while to process 12GB of data though. 但是,可能需要一段时间才能处理12GB的数据。

UPDATE: In fact, now that I think about it, it may be significantly faster to use binary I/O on such a large file, and then to replace the tabs with commas on large chunks of the file at a time. 更新:实际上,现在考虑到这一点,在这样大的文件上使用二进制I / O,然后一次在文件的大块上用逗号替换选项卡可能会更快。 This code follows that strategy: 此代码遵循该策略:

from io import FileIO

# This chunk size loads 1MB at a time for conversion.
CHUNK_SIZE = 1 << 20


def tsv_to_csv_BIG(filename):
    ext_index = filename.rfind('.tsv')
    if ext_index == -1:
        new_filename = filename + '.csv'
    else:
        new_filename = filename[:ext_index] + '.csv'

    original = FileIO(filename, 'r')
    new = FileIO(new_filename, 'w')
    table = bytes.maketrans(b'\t', b',')

    while True:
        chunk = original.read(CHUNK_SIZE)
        if len(chunk) == 0:
            break
        new.write(chunk.translate(table))

    original.close()
    new.close()
    return new_filename

On my laptop using a 1GB TSV file, the first function takes 4 seconds to translate to CSV while the second function takes 1 second. 在使用1GB TSV文件的笔记本电脑上,第一个功能需要4秒才能转换为CSV,而第二个功能需要1秒。 Tuning the CHUNK_SIZE parameter might speed it up more if your storage can keep up, but 1MB seems to be the sweet spot for me. 如果您的存储空间可以保持不变,则调整CHUNK_SIZE参数可能会加快速度,但是1MB似乎是我的最佳选择。

Using tr as mentioned in another answer took 3 seconds for me, so the chunked python approach seems fastest. 在另一个答案中提到使用tr花费了我3秒钟的时间,因此分块的python方法似乎最快。

You can use Python's built-in read and write to rewrite the file line by line. 您可以使用Python的内置readwrite重写由行的文件行。 This may take some time to process depending on your file size, but it shouldn't run out of memory since you're working line by line. 根据您的文件大小,这可能需要花费一些时间来处理,但是由于您正在逐行工作,因此它不应耗尽内存。

with open("input.tsv", "r") as input_file:
    for line in input_file:
        with open("output.csv", "a") as output:
            line = line.replace("\t", ",")
            output.write(line)

You can use chunksize to iterate over the entire file in pieces. 您可以使用chunksize遍历整个文件。 Note that this uses .read_csv() instead of .read_table() 请注意,这使用.read_csv()而不是.read_table()

df = pd.DataFrame()
for chunk in pd.read_csv('Check1_900.csv', header=None, names=['id', 'text', 'code'], chunksize=1000):
    df = pd.concat([df, chunk], ignore_index=True)

source 资源


You can also try the low_memory=False flag ( source ). 您也可以尝试low_memory=False标志( source )。


And then next would be the memory_map (scroll down at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html ) 接下来是memory_map (向下滚动至https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

memory_map : bool, default False memory_map :布尔值,默认为False

If a filepath is provided for filepath_or_buffer, map the file object directly onto memory and access the data directly from there. 如果为filepath_or_buffer提供了文件路径,则将文件对象直接映射到内存中,然后直接从那里访问数据。 Using this option can improve performance because there is no longer any I/O overhead. 使用此选项可以提高性能,因为不再有任何I / O开销。

Note that to_csv() has similar functionality. 请注意, to_csv()具有类似的功能。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM