简体   繁体   English

如何通过熊猫导入.dta并描述数据?

[英]How to import .dta via pandas and describe data?

I am new to python and have a simple problem. 我是python的新手,有一个简单的问题。 In a first step, I want to load some sample data I created in Stata. 第一步,我想加载我在Stata中创建的一些示例数据。 In a second step, I would like to describe the data in python - that is, I'd like a list of the imported variable names. 第二步,我想用python描述数据-也就是说,我想要一个导入变量名的列表。 So far I've done this: 到目前为止,我已经做到了:

from pandas.io.stata import StataReader

reader = StataReader('sample_data.dta')
data = reader.data()

dir()

I get the following error: 我收到以下错误:

anaconda/lib/python3.5/site-packages/pandas/io/stata.py:1375: UserWarning: 'data' is deprecated, use 'read' instead
  warnings.warn("'data' is deprecated, use 'read' instead")

What does it mean and how can I resolve the issue? 这是什么意思,我该如何解决? And, is dir() the right way to get an understanding of what variables I have in the data? 而且, dir()是了解我在数据中具有哪些变量的正确方法吗?

Using pandas.io.stata.StataReader.data to read from a stata file has been deprecated in pandas 0.18.1 version and hence you are getting that warning. pandas 0.18.1版本中已不建议使用pandas.io.stata.StataReader.data读取stata文件,因此您将收到该警告。

Instead, you must use pandas.read_stata to read the file as shown: 相反,您必须使用pandas.read_stata读取文件,如下所示:

df = pd.read_stata('sample_data.dta')
df.dtypes                                        ## Return the dtypes in this object

Sometimes this did not work for me especially when the dataset is large. 有时这对我不起作用,尤其是在数据集很大时。 So the thing I propose here is 2 steps (Stata and Python) 所以我在这里建议的是2个步骤(Stata和Python)

In Stata write the following commands: 在Stata中,编写以下命令:

export excel Cevdet.xlsx, firstrow(variables)

and to copy the variable labels write the following 并复制变量标签,写以下内容

describe, replace
    list
    export excel using myfile.xlsx, replace first(var)
restore

this will generate for you two files Cevdet.xlsx and myfile.xlsx 这将为您生成两个文件Cevdet.xlsxmyfile.xlsx

Now you go to your jupyter notebook 现在您去看Jupyter笔记本

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_excel('Cevdet.xlsx')

This will allow you to read both files into jupyter (python 3) 这将允许您将两个文件读入jupyter(python 3)

My advice is to save this data file (especially if it is big) 我的建议是保存此数据文件(尤其是大文件时)

df.to_pickle('Cevdet')

The next time you open jupyter you can simply run 下次打开jupyter时,您只需运行

df=pd.read_pickle("Cevdet")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM