Python vs Stata in data preparation

Question

I understand it is highly unlikely, but I can't figure out why Python outputs a slightly different dataset after simple manipulations, which I think are identical to those that I do in Stata. So, Stata:

use "filename", clear  
drop if varname < 1500  
sum

STATA (raw data)

Obs: 610
Mean: 1339.482
Std: 17.27477
Min: 1304
max: 1368

Checking for missing ( `mdesc varname` )

Missing: 10953
Total: 11563
Percent missing: 94.72

STATA (after `drop if varname < 1500` ):

varname | obs : 389 mean : 1350.599 Std.Dev. : 9.564949 Min: 1333 Max: 1368
Type: float

Meanwhile, Python:

import pandas as pd  
df = pd.read_stata("filename.dta", convert_missing = False)  
df = df[df.varname<1500]  
df.describe()

PYTHON (raw data: df=pd.read_stata("filename.dta") ) : varname
Count: 610
Mean: 1339.481934
Std: 17.274755
Min: 1304.000000
25%: 1326.000000
50%: 1341.000000
75%: 1353.000000
max: 1368.000000

df.isnull().sum()
varname 10953
So the number of missings in raw data is same in Stata and Python, but after dropping i get two different datasets.

PYTHON, after `df = df[df.varname<1500]` ##

Count: 288.000000
Mean: 1325.760376
Std: 13.369122
Min: 1304.000000
25%: 1316.000000
50%: 1325.000000
75%: 1332.000000
max: 1365.000000

In partcular, the differences are in counts of observations. For some variables there is a patterned difference, ie Stata: 11 342 obs, Python: 5064 obs (twice as few). For some variables, the difference is not patterned, just different values. The summary statistics are not too different, but different. I am new to Python, so can you please share if that is indeed possible that it operates on data differently from Stata?

Edit:

I figured out that I dropped incorrectly, instead of df = df[df.varname<1500] , I should have typed df_new = df.drop(df[df.varname< 1500].index) . I dont know the difference, but now I have the dataset that I need. Thanks everyone for spending time here!

Answer 1

I guess you misinterpret the behavior of boolean operation inside the df[] clause.

In pandas, the statement inside df[statement] must be True , so that it can be selected.

In your example, df = df[df.varname<1500] will returns what is True for df.varname<1500 . So you will get those rows satisfing df.varname<1500 , instead of dropping them.

Python vs Stata in data preparation

Question

STATA (raw data)

Checking for missing ( `mdesc varname` )

STATA (after `drop if varname < 1500` ):

PYTHON, after `df = df[df.varname<1500]` ##

Edit:

1 answers

solution1
2 ACCPTED 2020-05-25 15:24:09

Python vs Stata in data preparation

Question

STATA (raw data)

Checking for missing ( mdesc varname )

STATA (after drop if varname < 1500 ):

PYTHON, after df = df[df.varname<1500] ##

Edit:

1 answers

solution1 2 ACCPTED 2020-05-25 15:24:09

Checking for missing ( `mdesc varname` )

STATA (after `drop if varname < 1500` ):

PYTHON, after `df = df[df.varname<1500]` ##

solution1
2 ACCPTED 2020-05-25 15:24:09