[英]converting a dataframe to a csv file
I am working with a data Adult
that I have changed and would like to save it as a csv. however after saving it as a csv and re-loading the data to work with again, the data is not converted properly.我正在处理我已更改的
Adult
数据,并希望将其保存为 csv。但是在将其保存为 csv 并重新加载数据以再次使用后,数据未正确转换。 The headers are not preserved and some columns are now combined.标题未保留,现在合并了一些列。 I have looked through the page and online, but what I have tried is not working.
我浏览了页面和在线,但我尝试过的方法不起作用。 I load the data in with the following code:
我使用以下代码加载数据:
import numpy as np ##Import necassary packages
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import *
url2="http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" #Reading in Data from a freely and easily available source on the internet
Adult = pd.read_csv(url2, header=None, skipinitialspace=True) #Decoding data by removing extra spaces in cplumns with skipinitialspace=True
##Assigning reasonable column names to the dataframe
Adult.columns = ["age","workclass","fnlwgt","education","educationnum","maritalstatus","occupation",
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]
After inserting missing values and changing the data frame as desired I have tried:插入缺失值并根据需要更改数据框后,我尝试了:
df = Adult
df.to_csv('file_name.csv',header = True)
df.to_csv('file_name.csv')
and a few other variations.和其他一些变化。 How can I save the file to a CSV and preserve the correct format for the next time I read the file in?
如何将文件保存到 CSV 并在下次读取文件时保留正确的格式?
When re-loading the data I use the code:重新加载数据时,我使用代码:
import pandas as pd
df = pd.read_csv('file_name.csv')
when running df.head
the output is:运行
df.head
时 output 是:
<bound method NDFrame.head of Unnamed: 0 Unnamed: 0.1 age ... Black Asian-Pac-Islander Other
0 0 0 39 ... 0 0 0
1 1 1 50 ... 0 0 0
2 2 2 38 ... 0 0 0
3 3 3 53 ... 1 0 0
and print(df.loc[:,"age"].value_counts())
the output is:和
print(df.loc[:,"age"].value_counts())
output 是:
36 898
31 888
34 886
23 877
35 876
which should not have 2 columns不应有 2 列
If you pickle it like so:如果你这样腌制它:
Adult.to_pickle('adult.pickle')
You will, subsequently, be able to read it back in using read_pickle as follows:随后,您将能够使用read_pickle读回它,如下所示:
original_adult = pd.read_pickle('adult.pickle')
Hope that helps.希望有所帮助。
If you want to preserve the output column order you can specify the columns directly while saving the DataFrame:如果要保留 output 列顺序,可以在保存 DataFrame 时直接指定列:
import pandas as pd
url2 = "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
df = pd.read_csv(url2, header=None, skipinitialspace=True)
my_columns = ["age", "workclass", "fnlwgt", "education", "educationnum", "maritalstatus", "occupation",
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]
df.columns = my_columns
# do the computation ...
df[my_columns].to_csv('file_name.csv')
You can add parameter index=False
to the to_csv('file_name.csv', index=False)
function if you are not interested in saving the DataFrame row index.如果您对保存 DataFrame 行索引不感兴趣,可以将参数
index=False
添加到to_csv('file_name.csv', index=False)
function。 Otherwise, while reading the csv file again you'd need to specify the index_col
parameter.否则,在再次读取 csv 文件时,您需要指定
index_col
参数。
According to the documentation value_counts()
returns a Series
object - you see two columns because the first one is the index - Age (36, 31, ...), and the second is the count (898, 888, ...).根据文档
value_counts()
返回一个Series
object - 你会看到两列,因为第一列是索引 - 年龄 (36, 31, ...),第二列是计数 (898, 888, ...) .
I replicated your code and it works for me.我复制了你的代码,它对我有用。 The order of the columns is preserved.
保留列的顺序。
Let me show what I tried.让我展示一下我的尝试。 Tried this batch of code:
尝试了这批代码:
import numpy as np ##Import necassary packages
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import *
url2="http://archive.ics.uci.edu/ml/machine-learning-
databases/adult/adult.data" #Reading in Data from a freely and easily
available source on the internet
Adult = pd.read_csv(url2, header=None, skipinitialspace=True) #Decoding data
by removing extra spaces in cplumns with skipinitialspace=True
##Assigning reasonable column names to the dataframe
Adult.columns =["age","workclass","fnlwgt","education","educationnum","maritalstatus","occupation",
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]
This worked perfectly.这非常有效。 Then
然后
df = Adult
This also worked.这也奏效了。 Then I saved this data frame to a csv file.
然后我将这个数据框保存到一个 csv 文件中。 Make sure you are providing the absolute path to the file even if is is being saved in the same folder as this script.
确保您提供文件的绝对路径,即使该文件保存在与此脚本相同的文件夹中。
df.to_csv('full_path_to_the_file.csv',header = True)
# so someting like
#df.to_csv('Users/user_name/Desktop/folder/NameFile.csv',header = True)
Load this csv file into a new_df.将这个 csv 文件加载到 new_df 中。 It will generate a new column for keeping track of index.
它将生成一个新列来跟踪索引。 It is unnecessary and you can drop it like following:
这是不必要的,你可以像下面这样删除它:
new_df = pd.read_csv('Users/user_name/Desktop/folder/NameFile.csv', index_col = None)
new_df= new_df.drop('Unnamed: 0', axis =1)
When I compare the columns of the new_df from the original df, with this line of code当我将原始 df 的 new_df 的列与这行代码进行比较时
new_df.columns == df.columns
I get我明白了
array([ True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True])
You might not have been providing the absolute path to the file or saving the file twice as here.您可能没有像此处那样提供文件的绝对路径或两次保存文件。 You only need to save it once.
您只需要保存一次。
df.to_csv('file_name.csv',header = True)
df.to_csv('file_name.csv')
When you save the dataframe in general, the first column is the index, and you sould load the index when reading the dataframe, also whenever you assign a dataframe to a variable make sure to copy the dataframe:一般保存 dataframe 时,第一列是索引,读取 dataframe 时需要加载索引,同样,每当将 dataframe 赋值给变量时,确保复制 dataframe:
df = Adult.copy()
df.to_csv('file_name.csv',header = True)
And to read:并阅读:
df = pd.read_csv('file_name.csv', index_col=0)
The first columns from print(df.loc[:,"age"].value_counts())
is the index column which is shown if you query the datframe, to save this to a list, use the to_list
method: print(df.loc[:,"age"].value_counts())
的第一列是索引列,如果您查询 datframe 将显示该索引列,要将其保存到列表中,请使用to_list
方法:
print(df.loc[:,"age"].value_counts().to_list())
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.