简体   繁体   English

将 dataframe 转换为 csv 文件

[英]converting a dataframe to a csv file

I am working with a data Adult that I have changed and would like to save it as a csv. however after saving it as a csv and re-loading the data to work with again, the data is not converted properly.我正在处理我已更改的Adult数据,并希望将其保存为 csv。但是在将其保存为 csv 并重新加载数据以再次使用后,数据未正确转换。 The headers are not preserved and some columns are now combined.标题未保留,现在合并了一些列。 I have looked through the page and online, but what I have tried is not working.我浏览了页面和在线,但我尝试过的方法不起作用。 I load the data in with the following code:我使用以下代码加载数据:

import numpy as np ##Import necassary packages
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import *
url2="http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" #Reading in Data from a freely and easily available source on the internet
Adult = pd.read_csv(url2, header=None, skipinitialspace=True) #Decoding data by removing extra spaces in cplumns with skipinitialspace=True
##Assigning reasonable column names to the dataframe
Adult.columns = ["age","workclass","fnlwgt","education","educationnum","maritalstatus","occupation",  
                 "relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
                 "less50kmoreeq50kn"]

After inserting missing values and changing the data frame as desired I have tried:插入缺失值并根据需要更改数据框后,我尝试了:

df = Adult

df.to_csv('file_name.csv',header = True)

df.to_csv('file_name.csv')

and a few other variations.和其他一些变化。 How can I save the file to a CSV and preserve the correct format for the next time I read the file in?如何将文件保存到 CSV 并在下次读取文件时保留正确的格式?

When re-loading the data I use the code:重新加载数据时,我使用代码:

import pandas as pd
df = pd.read_csv('file_name.csv')

when running df.head the output is:运行df.head时 output 是:

<bound method NDFrame.head of        Unnamed: 0  Unnamed: 0.1  age  ... Black  Asian-Pac-Islander Other
0               0             0   39  ...     0                   0     0
1               1             1   50  ...     0                   0     0
2               2             2   38  ...     0                   0     0
3               3             3   53  ...     1                   0     0

and print(df.loc[:,"age"].value_counts()) the output is:print(df.loc[:,"age"].value_counts()) output 是:

36    898
31    888
34    886
23    877
35    876

which should not have 2 columns不应有 2 列

If you pickle it like so:如果你这样腌制它

Adult.to_pickle('adult.pickle')

You will, subsequently, be able to read it back in using read_pickle as follows:随后,您将能够使用read_pickle读回它,如下所示:

original_adult = pd.read_pickle('adult.pickle')

Hope that helps.希望有所帮助。

If you want to preserve the output column order you can specify the columns directly while saving the DataFrame:如果要保留 output 列顺序,可以在保存 DataFrame 时直接指定列:

import pandas as pd

url2 = "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" 
df = pd.read_csv(url2, header=None, skipinitialspace=True)

my_columns = ["age", "workclass", "fnlwgt", "education", "educationnum", "maritalstatus", "occupation",
             "relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
             "less50kmoreeq50kn"]
df.columns = my_columns

# do the computation ...

df[my_columns].to_csv('file_name.csv') 

You can add parameter index=False to the to_csv('file_name.csv', index=False) function if you are not interested in saving the DataFrame row index.如果您对保存 DataFrame 行索引不感兴趣,可以将参数index=False添加到to_csv('file_name.csv', index=False) function。 Otherwise, while reading the csv file again you'd need to specify the index_col parameter.否则,在再次读取 csv 文件时,您需要指定index_col参数。


According to the documentation value_counts() returns a Series object - you see two columns because the first one is the index - Age (36, 31, ...), and the second is the count (898, 888, ...).根据文档value_counts()返回一个Series object - 你会看到两列,因为第一列是索引 - 年龄 (36, 31, ...),第二列是计数 (898, 888, ...) .

I replicated your code and it works for me.我复制了你的代码,它对我有用。 The order of the columns is preserved.保留列的顺序。

Let me show what I tried.让我展示一下我的尝试。 Tried this batch of code:尝试了这批代码:

import numpy as np ##Import necassary packages
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import *

url2="http://archive.ics.uci.edu/ml/machine-learning- 
databases/adult/adult.data" #Reading in Data from a freely and easily 
available source on the internet

Adult = pd.read_csv(url2, header=None, skipinitialspace=True) #Decoding data 
by removing extra spaces in cplumns with skipinitialspace=True

##Assigning reasonable column names to the dataframe
Adult.columns =["age","workclass","fnlwgt","education","educationnum","maritalstatus","occupation",  
             "relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
             "less50kmoreeq50kn"]

This worked perfectly.这非常有效。 Then然后

df = Adult 

This also worked.这也奏效了。 Then I saved this data frame to a csv file.然后我将这个数据框保存到一个 csv 文件中。 Make sure you are providing the absolute path to the file even if is is being saved in the same folder as this script.确保您提供文件的绝对路径,即使该文件保存在与此脚本相同的文件夹中。

df.to_csv('full_path_to_the_file.csv',header = True)
# so someting like
#df.to_csv('Users/user_name/Desktop/folder/NameFile.csv',header = True)

Load this csv file into a new_df.将这个 csv 文件加载到 new_df 中。 It will generate a new column for keeping track of index.它将生成一个新列来跟踪索引。 It is unnecessary and you can drop it like following:这是不必要的,你可以像下面这样删除它:

new_df = pd.read_csv('Users/user_name/Desktop/folder/NameFile.csv', index_col = None)
new_df= new_df.drop('Unnamed: 0', axis =1)

When I compare the columns of the new_df from the original df, with this line of code当我将原始 df 的 new_df 的列与这行代码进行比较时

new_df.columns == df.columns

I get我明白了

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
    True,  True,  True,  True,  True,  True])

You might not have been providing the absolute path to the file or saving the file twice as here.您可能没有像此处那样提供文件的绝对路径或两次保存文件。 You only need to save it once.您只需要保存一次。

df.to_csv('file_name.csv',header = True)

df.to_csv('file_name.csv')

When you save the dataframe in general, the first column is the index, and you sould load the index when reading the dataframe, also whenever you assign a dataframe to a variable make sure to copy the dataframe:一般保存 dataframe 时,第一列是索引,读取 dataframe 时需要加载索引,同样,每当将 dataframe 赋值给变量时,确保复制 dataframe:

df = Adult.copy()
df.to_csv('file_name.csv',header = True)

And to read:并阅读:

df = pd.read_csv('file_name.csv', index_col=0)

The first columns from print(df.loc[:,"age"].value_counts()) is the index column which is shown if you query the datframe, to save this to a list, use the to_list method: print(df.loc[:,"age"].value_counts())的第一列是索引列,如果您查询 datframe 将显示该索引列,要将其保存到列表中,请使用to_list方法:

print(df.loc[:,"age"].value_counts().to_list())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM