[英]efficient way to read csv with numeric data in python
I try to convert a code writen in Matlab into python. 我尝试将在Matlab中编写的代码转换为python。 I'm trying to read dat file (it's a csv file).
我正在尝试读取dat文件(这是一个csv文件)。 that file has about 30 columns and thousands of rows containing (only!) decimal number data (in Matlab it was read into double matrix).
该文件大约有30列和数千行,其中包含(仅!)十进制数数据(在Matlab中已被读取为双矩阵)。 I'm asking for the fastest way to read the dat file and the most similar object/array/... to save the data into.
我正在寻求最快的方式来读取dat文件以及最相似的object / array / ...以将数据保存到其中。
I tried to read the file in both of the following ways: 我尝试通过以下两种方式读取文件:
my_data1 = numpy.genfromtxt('FileName.dat', delimiter=',' )
my_data2 = pd.read_csv('FileName.dat',delimiter=',')
Is there any better option? 有没有更好的选择?
pd.read_csv
is pretty efficient as it is. pd.read_csv
确实非常有效。 To make it faster, you can use try to use multiple cores to load your data in parallel. 为了使其更快,您可以尝试使用多个内核并行加载数据。 Here is some code example where I used
joblib
when I needed to make data loading with pd.read_csv
and processing of that data faster. 这里是我使用了一些代码示例
joblib
当我需要使数据加载与pd.read_csv
的数据更快和处理。
from os import listdir
from os.path import dirname, abspath, isfile, join
import pandas as pd
import sys
import time
from datetime import datetime
# Multi-threading
from joblib import Parallel, delayed
import multiprocessing
# Garbage collector
import gc
# Number of cores
TOTAL_NUM_CORES = multiprocessing.cpu_count()
# Path of this script's file
DATA_PATH = 'D:\\'
# Path to save the processed files
TARGET_PATH = 'C:\\'
def read_and_convert(f,num_files):
#global i
# Read the file
dataframe = pd.read_csv(DATA_PATH + f, low_memory=False, header=None, names=['Symbol', 'Date_Time', 'Bid', 'Ask'], index_col=1, parse_dates=True)
# Process the data
data_ask_bid = process_data(dataframe)
# Store processed data in target folder
data_ask_bid.to_csv(TARGET_PATH + f)
print(f)
# Garbage collector. I needed to use this, otherwise my memory would get full after a few files, but you might not need it.
gc.collect()
def main():
# Counter for converted files
global i
i = 0
start_time = time.time()
# Get the paths for all the data files
files_names = [f for f in listdir(DATA_PATH) if isfile(join(DATA_PATH, f))]
# Load and process files in parallel
Parallel(n_jobs=TOTAL_NUM_CORES)(delayed(read_and_convert)(f,len(files_names)) for f in files_names)
# for f in files_names: read_and_convert(f,len(files_names)) # non-parallel
print("\nTook %s seconds." % (time.time() - start_time))
if __name__ == "__main__":
main()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.