[英]Creating editable copy of a dataset while keeping the original intact using Python
I have recently started studying machine learning and am focusing on the pre-processing stage.我最近开始学习机器学习,专注于预处理阶段。 I'm creating a Jupyter Notebook which lays out the stages of pre-processing step by step.我正在创建一个 Jupyter Notebook,它逐步列出了预处理的各个阶段。 The original dataset has some missing values in it which I will replace with the mean value.原始数据集中有一些缺失值,我将用平均值替换它们。 As it is a step by step notebook I am creating, I would like to be able to keep the original dataset in tact while having a copy of the dataset which will be updated at different steps of the process ie missing cells will be replaced with the mean values.由于这是我正在创建的一步一步的笔记本,我希望能够保持原始数据集完好无损,同时拥有数据集的副本,该副本将在流程的不同步骤更新,即丢失的单元格将被替换为平均值。 See the code below for what I have done so far.请参阅下面的代码了解我到目前为止所做的事情。 It does what I want it to do so far, it's just missing the copied dataset part.到目前为止,它完成了我希望它做的事情,只是缺少复制的数据集部分。
Any tips or links to tutorials would be appreciated.任何提示或教程链接将不胜感激。 Thanks.谢谢。
#libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
#Importing dataset
dataset = pd.read_csv('example.csv')
# Splitting the attributes into independent and dependent attributes
X = dataset.iloc[:, :-1].values # attributes to determine dependent variable
Y = dataset.iloc[:, 4].values # dependent variable / Class variable, final column
#printing and displaying dataset
print(dataset)
display(dataset.describe())
#check how many null values in dataset and output value
print('Number of null/NaN values in dataset: ',dataset.isnull().values.sum())
#show how many null values per column
print(dataset.isnull().sum())
#DEALING WITH MISSING VALUES USING MEAN
from sklearn.impute import SimpleImputer
#creating SimpleImputer object, specifying to change missing values to mean
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
#MEAN
imputer = imputer.fit(X[:, 3:4])
X[:,3:4] = imputer.transform(X[:,3:4])
print(X)
I would rather prefer to use copy class in python by calling deepcopy
我宁愿通过调用deepcopy
在 python 中使用副本class
import copy
df_edit = copy.deepcopy(df_original)
Now you can play with df_edit
and make your changes without disturbing df_original
现在您可以玩df_edit
并在不打扰df_original
的情况下进行更改
or或者
you can directly use pandas copy module like df_edit = df_original.copy(deep = True)
您可以直接使用pandas 复制模块,如df_edit = df_original.copy(deep = True)
So you just want to create a copy of dataset
to revert to later?所以你只是想创建一个dataset
的副本以供以后恢复?
You could create a chunk in your notebook to back it up eg您可以在笔记本中创建一个块来备份它,例如
dataset_backup = dataset
then another to overwrite dataset with dataset backup然后另一个用数据集备份覆盖数据集
dataset = dataset_backup
Then run each as and when you want to backup the dataset or revert to the backup然后在您想要备份数据集或恢复到备份时运行每个
Alternatively, if you want to have a record of dataset at each step in the process just create a new variable for it each time eg dataset_means
this is usually a good idea for debugging purposes.或者,如果您想在流程的每个步骤中记录数据集,只需每次为它创建一个新变量,例如dataset_means
,这通常是调试目的的好主意。
Is this what you were asking?这是你问的吗?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.