使用 Python 创建数据集的可编辑副本，同时保持原始数据不变

Question

I have recently started studying machine learning and am focusing on the pre-processing stage.我最近开始学习机器学习，专注于预处理阶段。 I'm creating a Jupyter Notebook which lays out the stages of pre-processing step by step.我正在创建一个 Jupyter Notebook，它逐步列出了预处理的各个阶段。 The original dataset has some missing values in it which I will replace with the mean value.原始数据集中有一些缺失值，我将用平均值替换它们。 As it is a step by step notebook I am creating, I would like to be able to keep the original dataset in tact while having a copy of the dataset which will be updated at different steps of the process ie missing cells will be replaced with the mean values.由于这是我正在创建的一步一步的笔记本，我希望能够保持原始数据集完好无损，同时拥有数据集的副本，该副本将在流程的不同步骤更新，即丢失的单元格将被替换为平均值。 See the code below for what I have done so far.请参阅下面的代码了解我到目前为止所做的事情。 It does what I want it to do so far, it's just missing the copied dataset part.到目前为止，它完成了我希望它做的事情，只是缺少复制的数据集部分。

Any tips or links to tutorials would be appreciated.任何提示或教程链接将不胜感激。 Thanks.谢谢。

#libraries
import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd 

from sklearn.impute import SimpleImputer

#Importing dataset
dataset = pd.read_csv('example.csv')

# Splitting the attributes into independent and dependent attributes
X = dataset.iloc[:, :-1].values # attributes to determine dependent variable
Y = dataset.iloc[:, 4].values # dependent variable / Class variable, final column


#printing and displaying dataset
print(dataset)
display(dataset.describe())

#check how many null values in dataset and output value
print('Number of null/NaN values in dataset: ',dataset.isnull().values.sum())

#show how many null values per column
print(dataset.isnull().sum())

#DEALING WITH MISSING VALUES USING MEAN
from sklearn.impute import SimpleImputer

#creating SimpleImputer object, specifying to change missing values to mean
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

#MEAN
imputer = imputer.fit(X[:, 3:4])
X[:,3:4] = imputer.transform(X[:,3:4])

print(X)

Answer 1

I would rather prefer to use copy class in python by calling deepcopy我宁愿通过调用deepcopy在 python 中使用副本class

import copy
df_edit = copy.deepcopy(df_original)

Now you can play with df_edit and make your changes without disturbing df_original现在您可以玩df_edit并在不打扰df_original的情况下进行更改

or或者

you can directly use pandas copy module like df_edit = df_original.copy(deep = True)您可以直接使用pandas 复制模块，如df_edit = df_original.copy(deep = True)

Answer 2

So you just want to create a copy of dataset to revert to later?所以你只是想创建一个dataset的副本以供以后恢复？

You could create a chunk in your notebook to back it up eg您可以在笔记本中创建一个块来备份它，例如

dataset_backup = dataset

then another to overwrite dataset with dataset backup然后另一个用数据集备份覆盖数据集

dataset = dataset_backup

Then run each as and when you want to backup the dataset or revert to the backup然后在您想要备份数据集或恢复到备份时运行每个

Alternatively, if you want to have a record of dataset at each step in the process just create a new variable for it each time eg dataset_means this is usually a good idea for debugging purposes.或者，如果您想在流程的每个步骤中记录数据集，只需每次为它创建一个新变量，例如dataset_means ，这通常是调试目的的好主意。

Is this what you were asking?这是你问的吗？

使用 Python 创建数据集的可编辑副本，同时保持原始数据不变

问题描述

2 个解决方案

解决方案1
2 2020-04-23 10:26:03

解决方案2
0 2020-04-23 10:15:29

使用 Python 创建数据集的可编辑副本，同时保持原始数据不变

问题描述

2 个解决方案

解决方案1 2 2020-04-23 10:26:03

解决方案2 0 2020-04-23 10:15:29

解决方案1
2 2020-04-23 10:26:03

解决方案2
0 2020-04-23 10:15:29