简体   繁体   English

为机器学习加载数据

[英]Loading data for Machine learning

I have a dataset with >100,000 data points.我有一个包含 >100,000 个数据点的数据集。 I am creating ML model and plots for subset of data every time when it meets certain condition.我正在创建 ML model 并在每次满足特定条件时绘制数据子集。

Will it be better if i load the data before for loop.如果我在 for 循环之前加载数据会更好吗? Or, load the data every time inside for loop.或者,每次在 for 循环中加载数据。

In first case it will take less time to run "for loop" because i am not loading the data every time, but memory is allocated for all data entire time.在第一种情况下,运行“for 循环”将花费更少的时间,因为我不是每次都加载数据,但是 memory 一直分配给所有数据。

data = pd.read_csv("sample.csv")
data.drop(['column2', 'column3']

for i in range(0,10):
    data['column1'] == i
    # performing the machine learning model and plots

In second case i will be loading the dataset every time but only subset of data will be remaining in the memory after i drop columns and subset the data.在第二种情况下,我将每次都加载数据集,但在我删除列并对数据进行子集化后,只有数据子集将保留在 memory 中。

for i in range(0,10):
    data = pd.read_csv("sample.csv")
    data.drop(['column2', 'column3']
    data['column1'] == i

Which is a better approach?哪种方法更好?

I have tried both, but want to know which is correct.我都试过了,但想知道哪个是正确的。

I think in 1st approach: you will insert the data once and it will loops according to the condition.我认为在第一种方法中:您将插入一次数据,它会根据条件循环。

But in 2nd approach: for each loop it has to loads and drop certain columns of your data which will take a lot of time.但是在第二种方法中:对于每个循环,它都必须加载和删除数据的某些列,这将花费很多时间。

My suggestion is to go with the 1st approach because the run time less and it is the correct way to approach.我的建议是使用第一种方法 go,因为运行时间更少,而且这是正确的方法。

Hope it helps your question.希望对你的问题有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM