提高代码的时间效率，使用大数据集 - Python

Question

我试图找出一种方法来“优化”我的代码并减少（理想情况下）运行整个数据集所需的时间。

我正在使用一个简单的.csv 文件，它有 3 列： time_UTC 、 vmag2D和vdir 。 数据集约为 1420000 行（100 万、402 万）。 我写了这个简单的循环，运行大约需要 15/20 分钟。 我在我的带有 M1 处理器的 Mac 上运行它，所以我不确定我是否在某个地方复杂化了它需要这么多时间（意思是，我不相信处理器是那么“糟糕”并且它足以更快地运行这段小代码）如果有人对我如何改进它有任何建议，请告诉我！

import pandas as pd

path_data = '" *insert a path here* "'
file = path_data + ' *name of the .csv file* '

data = pd.read_csv(file)

time_UTC = []
vmag2D = []
vdir = []

for i in range(len(data)):
    x = data.iloc[i][0]
    x1 = x.split(' ')
    x2 = x1[1].split(';')
    date = x.split(' ')[0]
    time_UTC.append(x2[0])
    vmag2D.append(x2[1])
    vdir.append(x2[2])

该代码正在解析.csv文件中的每一行，并且每一行都有相同的“模板”： '1994-01-01 00:05:00;0.52;193'

感谢您的任何帮助！

干杯!

Answer 1

您可以一次拆分整个列

import pandas as pd
import numpy as np

df = pd.DataFrame({"all": ["1994-01-01 00:05:00;0.52;193"]*1000})

# split at space " "
df[["date", "time vmag vdir"]] = df["all"].str.split(" ", expand=True)

# split at ";"
df[["time", "vmag2D", "vdir"]] = df['time vmag vdir'].str.split(';', expand=True)

date = pd.to_datetime(df["date"]).to_list()
time_UTC = pd.to_datetime(df["time"]).to_list()
vmag2D = pd.to_numeric(df["vmag2D"]).to_list()
vdir = pd.to_numeric(df["vdir"]).to_list()

Answer 2

不必为您的代码使用任何类型的 for 循环。 您正在使用 pandas 阅读 CSV，但您似乎没有指定正确的参数。

import pandas as pd

path_data = '" *insert a path here* "'
file = path_data + ' *name of the .csv file* '

df = pd.read_csv(file, sep=';', parse_dates=[0], engine='c', header=None)

time_UTC = df.iloc[:, 0]
vmag2D = df.iloc[:, 1]
vdir = df.iloc[:, 2]

如果你运行它，你的结果变量（ time_UTC ，...）将是pandas.Series类型。 您可以使用.to_list()将它们转换为list ，或使用 .values 访问.values数组。

Note that I am specifying engine='c' here in the pandas CSV parser, which is using a native C parser that is faster than its python equivalent, as you are processing a large file here.

提高代码的时间效率，使用大数据集 - Python

问题描述

2 个解决方案

解决方案1
1 2022-08-04 17:15:57

解决方案2
1 已采纳 2022-08-04 17:22:59

提高代码的时间效率，使用大数据集 - Python

问题描述

2 个解决方案

解决方案1 1 2022-08-04 17:15:57

解决方案2 1 已采纳 2022-08-04 17:22:59

解决方案1
1 2022-08-04 17:15:57

解决方案2
1 已采纳 2022-08-04 17:22:59