简体   繁体   English

在python中使用数字数据读取csv的有效方法

[英]efficient way to read csv with numeric data in python

I try to convert a code writen in Matlab into python. 我尝试将在Matlab中编写的代码转换为python。 I'm trying to read dat file (it's a csv file). 我正在尝试读取dat文件(这是一个csv文件)。 that file has about 30 columns and thousands of rows containing (only!) decimal number data (in Matlab it was read into double matrix). 该文件大约有30列和数千行,其中包含(仅!)十进制数数据(在Matlab中已被读取为双矩阵)。 I'm asking for the fastest way to read the dat file and the most similar object/array/... to save the data into. 我正在寻求最快的方式来读取dat文件以及最相似的object / array / ...以将数据保存到其中。

I tried to read the file in both of the following ways: 我尝试通过以下两种方式读取文件:

my_data1 = numpy.genfromtxt('FileName.dat', delimiter=',' )
my_data2 = pd.read_csv('FileName.dat',delimiter=',')

Is there any better option? 有没有更好的选择?

pd.read_csv is pretty efficient as it is. pd.read_csv确实非常有效。 To make it faster, you can use try to use multiple cores to load your data in parallel. 为了使其更快,您可以尝试使用多个内核并行加载数据。 Here is some code example where I used joblib when I needed to make data loading with pd.read_csv and processing of that data faster. 这里是我使用了一些代码示例joblib当我需要使数据加载与pd.read_csv的数据更快和处理。

from os import listdir
from os.path import dirname, abspath, isfile, join
import pandas as pd
import sys
import time
from datetime import datetime
# Multi-threading
from joblib import Parallel, delayed
import multiprocessing
# Garbage collector
import gc

# Number of cores
TOTAL_NUM_CORES = multiprocessing.cpu_count()
# Path of this script's file
DATA_PATH = 'D:\\'
# Path to save the processed files
TARGET_PATH = 'C:\\'

def read_and_convert(f,num_files):
    #global i
    # Read the file
    dataframe = pd.read_csv(DATA_PATH + f, low_memory=False, header=None, names=['Symbol', 'Date_Time', 'Bid', 'Ask'], index_col=1, parse_dates=True)
    # Process the data
    data_ask_bid = process_data(dataframe)
    # Store processed data in target folder
    data_ask_bid.to_csv(TARGET_PATH + f)
    print(f)
    # Garbage collector. I needed to use this, otherwise my memory would get full after a few files, but you might not need it.
    gc.collect()

def main():
    # Counter for converted files
    global i
    i = 0
    start_time = time.time()
    # Get the paths for all the data files
    files_names = [f for f in listdir(DATA_PATH) if isfile(join(DATA_PATH, f))]

    # Load and process files in parallel
    Parallel(n_jobs=TOTAL_NUM_CORES)(delayed(read_and_convert)(f,len(files_names)) for f in files_names)
    # for f in files_names: read_and_convert(f,len(files_names)) # non-parallel
    print("\nTook %s seconds." % (time.time() - start_time))

if __name__ == "__main__":
    main()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM