简体   繁体   English

使用 Python 创建数据集的可编辑副本,同时保持原始数据不变

[英]Creating editable copy of a dataset while keeping the original intact using Python

I have recently started studying machine learning and am focusing on the pre-processing stage.我最近开始学习机器学习,专注于预处理阶段。 I'm creating a Jupyter Notebook which lays out the stages of pre-processing step by step.我正在创建一个 Jupyter Notebook,它逐步列出了预处理的各个阶段。 The original dataset has some missing values in it which I will replace with the mean value.原始数据集中有一些缺失值,我将用平均值替换它们。 As it is a step by step notebook I am creating, I would like to be able to keep the original dataset in tact while having a copy of the dataset which will be updated at different steps of the process ie missing cells will be replaced with the mean values.由于这是我正在创建的一步一步的笔记本,我希望能够保持原始数据集完好无损,同时拥有数据集的副本,该副本将在流程的不同步骤更新,即丢失的单元格将被替换为平均值。 See the code below for what I have done so far.请参阅下面的代码了解我到目前为止所做的事情。 It does what I want it to do so far, it's just missing the copied dataset part.到目前为止,它完成了我希望它做的事情,只是缺少复制的数据集部分。

Any tips or links to tutorials would be appreciated.任何提示或教程链接将不胜感激。 Thanks.谢谢。

#libraries
import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd 

from sklearn.impute import SimpleImputer

#Importing dataset
dataset = pd.read_csv('example.csv')

# Splitting the attributes into independent and dependent attributes
X = dataset.iloc[:, :-1].values # attributes to determine dependent variable
Y = dataset.iloc[:, 4].values # dependent variable / Class variable, final column


#printing and displaying dataset
print(dataset)
display(dataset.describe())

#check how many null values in dataset and output value
print('Number of null/NaN values in dataset: ',dataset.isnull().values.sum())

#show how many null values per column
print(dataset.isnull().sum())

#DEALING WITH MISSING VALUES USING MEAN
from sklearn.impute import SimpleImputer

#creating SimpleImputer object, specifying to change missing values to mean
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

#MEAN
imputer = imputer.fit(X[:, 3:4])
X[:,3:4] = imputer.transform(X[:,3:4])

print(X)

I would rather prefer to use copy class in python by calling deepcopy我宁愿通过调用deepcopy在 python 中使用副本class

import copy
df_edit = copy.deepcopy(df_original)

Now you can play with df_edit and make your changes without disturbing df_original现在您可以玩df_edit并在不打扰df_original的情况下进行更改

or或者

you can directly use pandas copy module like df_edit = df_original.copy(deep = True)您可以直接使用pandas 复制模块,如df_edit = df_original.copy(deep = True)

So you just want to create a copy of dataset to revert to later?所以你只是想创建一个dataset的副本以供以后恢复?

You could create a chunk in your notebook to back it up eg您可以在笔记本中创建一个块来备份它,例如

dataset_backup = dataset

then another to overwrite dataset with dataset backup然后另一个用数据集备份覆盖数据集

dataset = dataset_backup

Then run each as and when you want to backup the dataset or revert to the backup然后在您想要备份数据集或恢复到备份时运行每个

Alternatively, if you want to have a record of dataset at each step in the process just create a new variable for it each time eg dataset_means this is usually a good idea for debugging purposes.或者,如果您想在流程的每个步骤中记录数据集,只需每次为它创建一个新变量,例如dataset_means ,这通常是调试目的的好主意。

Is this what you were asking?这是你问的吗?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 操纵变量的一部分,同时保持原始状态不变 - Manipulate a part of variable while keeping the original intact 如何在保持原样的同时修改模块和软件包? - How can I modify modules and packages while keeping the original intact? Python-扩展其他字符串时保持字符串完整 - Python - Keeping string intact while extending other string 逐个字符分隔 python 字符串,同时保持内联标签完整 - Seperating a python string by character while keeping inline tags intact 在 python 中使用 cv2 创建图像数据集时出错 - error while while creating image dataset using cv2 in python 如何在保持.py 源代码可编辑的同时将 python 打包为 exe? - How to pack a python to exe while keeping .py source code editable? next(iter()) 在 python 中使用 tensorflow 创建数据集时抛出错误 - next(iter()) is throwing error while creating dataset using tensorflow in python Python 3 - 如何将字符串中的每个字符拆分为列表,同时保持十进制数字不变? - Python 3 - How to split every character in a string into a list while keeping decimal numbers intact? 按列排序和选择,同时将原始索引保留在python中 - sort and pick by column while keeping the original index in python 如何在保持参考原始 Python 的同时更改列表的顺序 - How to change the order of a list while keeping reference to the original Python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM