简体   繁体   English

对于非常大的 csv 文件的基本数学计算,当我的 csv 中混合了数据类型时,如何更快地执行此操作 - 使用 python

[英]For basic maths calculations on very large csv files how can I do this faster when I have mixed datatypes in my csv - with python

I have some very large CSV files (+15Gb) that contain 4 initial rows of meta data / header info and then the data.我有一些非常大的 CSV 文件(+15Gb),其中包含 4 行元数据/标题信息,然后是数据。 The first 3 columns are 3D Cartesian coordinates and are the values I need to change with basic maths operations.前 3 列是 3D 笛卡尔坐标,是我需要通过基本数学运算更改的值。 eg Add, subtract, multiple, divide.例如加、减、乘、除。 I need to do this on mass to each of the coordinate columns.我需要对每个坐标列执行此操作。 The first 3 columns are float type values前 3 列是浮点型值

The rest of the columns in the CSV could be of any type, eg string, int, etc.... CSV 中的其余列可以是任何类型,例如字符串、整数等......

I currently use a script where I can read in each row of the csv and make the modification, then write to a new file and it seems to work fine.我目前使用一个脚本,我可以在其中读取 csv 的每一行并进行修改,然后写入一个新文件,它似乎工作正常。 But the problem is it takes days on a large file.但问题是处理一个大文件需要几天时间。 The machine I'm running on has plenty of memory (120Gb), but mu current method doesn't utilise that.我正在运行的机器有足够的内存(120Gb),但是目前的 mu 方法没有利用它。

I know I can update a column on mass using a numpy 2D array if I skip the 4 metadata rows.我知道如果我跳过 4 个元数据行,我可以使用 numpy 2D 数组更新质量列。 eg例如

arr = np.genfromtxt(input_file_path, delimiter=',', skip_header=4)
arr[:,0]=np.add(arr[:,0],300)

this will update the first column by adding 300 to each value.这将通过向每个值添加 300 来更新第一列。 But the issue I have with trying to use numpy is但是我在尝试使用 numpy 时遇到的问题是

  1. Numpy arrays don't support mixed data types for the rest of the columns that will be imported (I don't know what the other columns will hold so I can't use structured arrays - or rather i want it to be a universal tool so I don't have to know what they will hold) Numpy 数组不支持将导入的其余列的混合数据类型(我不知道其他列将保存什么,所以我不能使用结构化数组 - 或者更确切地说,我希望它成为一个通用工具所以我不必知道他们会持有什么)

  2. I can export the numpy array to csv (providing it's not mixed types) and just using regular text functions I can create a separate CSV for the 4 rows of metadata, but then I need to somehow concatenate them and I don't want to have read through all the lines of the data csv just to append it to the bottom of the metadata csv.我可以将 numpy 数组导出到 csv(前提是它不是混合类型),只需使用常规文本函数,我就可以为 4 行元数据创建一个单独的 CSV,但是我需要以某种方式连接它们,我不想有通读数据 csv 的所有行,只是将其附加到元数据 csv 的底部。

I know if I can make this work with Numpy it will greatly increase the speed by utilizing the machine's large amount of memory, by holding the entire csv in memory while I do operations.我知道如果我可以用 Numpy 完成这项工作,它将通过利用机器的大量内存,在我进行操作时将整个 csv 保存在内存中来大大提高速度。 I've never used pandas but would also consider using it for a solution.我从未使用过熊猫,但也会考虑使用它作为解决方案。 I've had a bit of a look into pandas thinking I maybe able to do it with dataframes but I still need to figure out how to have 4 rows as my column header instead of one and additionally I haven't seen a way to apply a mass update to the whole column (like I can with numpy) without using a python loop - not sure if that would make it slow or not if it's already in memory.我对 Pandas 进行了一些研究,认为我也许可以用数据框来做到这一点,但我仍然需要弄清楚如何将 4 行作为我的列标题而不是一个,另外我还没有看到一种应用方法在不使用 python 循环的情况下对整个列进行大量更新(就像我可以使用 numpy)-不确定如果它已经在内存中是否会使其变慢。

潜在数据的图像

The metadata can be empty for rows 2,3,4 but in most cases row 4 will have the data type recorded.第 2、3、4 行的元数据可以为空,但在大多数情况下,第 4 行将记录数据类型。 There could be up to 200 data columns in addition to the initial 3 coordinate columns.除了最初的 3 个坐标列之外,最多可以有 200 个数据列。

My current (slow) code looks like this:我当前的(慢)代码如下所示:

import os
import subprocess
import csv
import numpy as np


def move_txt_coords_to(move_by_coords, input_file_path, output_file_path):

    # create new empty output file
    open(output_file_path, 'a').close()

    with open(input_file_path, newline='') as f:
        reader = csv.reader(f)
        for idx, row in enumerate(reader):
            if idx < 4:
                append_row(output_file_path, row)
            else:
                new_x = round(float(row[0]) + move_by_coords['x'], 3)
                new_y = round(float(row[1]) + move_by_coords['y'], 3)
                new_z = round(float(row[2]) + move_by_coords['z'], 3)
                row[0] = new_x
                row[1] = new_y
                row[2] = new_z
                append_row(output_file_path, row)


def append_row(output_file, row):
    f = open(output_file, 'a', newline='')
    writer = csv.writer(f, delimiter=',')
    writer.writerow(row)
    f.close()


if __name__ == '__main__':
    move_by_coords = {
        'x': -338802.5,
        'y': -1714752.5,
        'z': 0
    }

    input_file_path = r'D:\incoming_data\large_data_set1.csv'
    output_file_path = r'D:\outgoing_data\large_data_set_relocated.csv'
    move_txt_coords_to(move_by_coords, input_file_path, output_file_path)

Okay so I've got an almost complete answer and it was so much easier than trying to use numpy.好的,所以我得到了一个几乎完整的答案,这比尝试使用 numpy 容易得多。

import pandas pd

    input_file_path = r'D:\input\large_data.csv'
    output_file_path = r'D:\output\large_data_relocated.csv'

    move_by_coords = {
            'x': -338802.5,
            'y': -1714752.5,
            'z': 0
        }

    df = pd.read_csv(input_file_path, header=[0,1,2,3])
    df.centroid_x += move_by_coords['x']
    df.centroid_y += move_by_coords['y']
    df.centroid_z += move_by_coords['z']

    df.to_csv(output_file_path,sep=',')

But I have one remaining issue (possibly 2).但我还有一个问题(可能是 2 个)。 The blanks cells in my header are being populated with Unnamed.我的标题中的空白单元格填充了 Unnamed。 I somehow need it to sub in a blank string for those in the header row.我以某种方式需要它为标题行中的那些子代入一个空白字符串。

在此处输入图片说明

Also @FBruzzesi has warned me I made need to use batchsize to make it more efficient which i'll need to check out.另外@FBruzzesi 警告我我需要使用batchsize 来提高效率,我需要检查一下。

---------------------Update------------- Okay I resolved the multiline header issue. ---------------------更新------------- 好的,我解决了多行标题问题。 I just use the regular csv reader module to read the first 4 rows into a list of rows, then I transpose this to be a list of columns where I convert the column list to tuples at the same time.我只是使用常规的 csv 阅读器模块将前 4 行读入行列表,然后将其转置为列列表,同时将列列表转换为元组。 Once I have a list of column header tuples (where the tuples consist of each of the rows within that column header), I can use the list to name the header.一旦我有一个列标题元组列表(其中元组由该列标题中的每一行组成),我就可以使用该列表来命名标题。 I there fore skip the header rows on reading the csv to the data frame, and then update each column by it's index.因此,我在将 csv 读取到数据框时跳过标题行,然后通过它的索引更新每一列。 I also drop the index column on export back to csv once done.完成后,我还会将导出时的索引列删除回 csv。 It seems work very well.它似乎工作得很好。

import csv
import itertools
import pandas as pd


def make_first_4rows_list_of_tuples(input_csv_file_path):
    f = open(input_csv_file_path, newline='')
    reader = csv.reader(f)
    header_rows = []
    for row in itertools.islice(reader, 0, 4):
        header_rows.append(row)

    header_col_tuples = list(map(tuple, zip(*header_rows)))
    print("Header columns: \n", header_col_tuples)
    return header_col_tuples


if __name__ == '__main__':
    move_by_coords = {
        'x': 1695381.5,
        'y': 5376792.5,
        'z': 100
    }

    input_file_path = r'D:\temp\mydata.csv'
    output_file_path = r'D:\temp\my_updated_data.csv'

    column_headers = make_first_4rows_list_of_tuples(input_file_path)
    df = pd.read_csv(input_file_path, skiprows=4, names=column_headers)
    df.iloc[:, 0] += move_by_coords['x']
    df.iloc[:, 1] += move_by_coords['y']
    df.iloc[:, 2] += move_by_coords['z']
    df.to_csv(output_file_path, sep=',', index=False)

更新和导出的 csv

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM