I understand it is highly unlikely, but I can't figure out why Python outputs a slightly different dataset after simple manipulations, which I think are identical to those that I do in Stata. So, Stata:
use "filename", clear
drop if varname < 1500
sum
Obs: 610
Mean: 1339.482
Std: 17.27477
Min: 1304
max: 1368
mdesc varname
) Missing: 10953
Total: 11563
Percent missing: 94.72
drop if varname < 1500
): varname | obs : 389 mean : 1350.599 Std.Dev. : 9.564949 Min: 1333 Max: 1368
Type: float
Meanwhile, Python:
import pandas as pd
df = pd.read_stata("filename.dta", convert_missing = False)
df = df[df.varname<1500]
df.describe()
PYTHON (raw data: df=pd.read_stata("filename.dta")
) : varname
Count: 610
Mean: 1339.481934
Std: 17.274755
Min: 1304.000000
25%: 1326.000000
50%: 1341.000000
75%: 1353.000000
max: 1368.000000
df.isnull().sum()
varname 10953
So the number of missings in raw data is same in Stata and Python, but after dropping i get two different datasets.
df = df[df.varname<1500]
## Count: 288.000000
Mean: 1325.760376
Std: 13.369122
Min: 1304.000000
25%: 1316.000000
50%: 1325.000000
75%: 1332.000000
max: 1365.000000
In partcular, the differences are in counts of observations. For some variables there is a patterned difference, ie Stata: 11 342 obs, Python: 5064 obs (twice as few). For some variables, the difference is not patterned, just different values. The summary statistics are not too different, but different. I am new to Python, so can you please share if that is indeed possible that it operates on data differently from Stata?
I figured out that I dropped incorrectly, instead of df = df[df.varname<1500]
, I should have typed df_new = df.drop(df[df.varname< 1500].index)
. I dont know the difference, but now I have the dataset that I need. Thanks everyone for spending time here!
I guess you misinterpret the behavior of boolean operation inside the df[]
clause.
In pandas, the statement inside df[statement]
must be True
, so that it can be selected.
In your example, df = df[df.varname<1500]
will returns what is True
for df.varname<1500
. So you will get those rows satisfing df.varname<1500
, instead of dropping them.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.