Data subsetting in Stata with integers and strings vs. R

Question

I've never used Stata before and have a very scanty knowledge of it. I've been trying to collapse a dataset of bilateral information according to year , country1 , country2 , and take the means of all other information. In R, I tried running:

aggregate(dataset,by=list(dataset$year,dataset$country1,dataset$country2),FUN=mean,na.rm=TRUE)

The dataset is too large for my computer's RAM to handle my collapsing in R (another issue I can't solve), and when a colleague attempted to run the code, other data were not shown as means (in some cases, only the data from one row of a particular dyad-year was selected; in others, I'm not even sure what happened). Smaller subsets of the dataset showed correct results.

Because of the issue in R, I want to try doing this in Stata, but whereas I previously attempted using

collapse (mean) <every variable I wanted a ``mean'' of, or otherwise wanted to remove from the dataset>, by(year country1 country2)

Stata did not know how to handle strings. I have so little understanding of Stata, that I can't figure out how to resolve this issue. Could someone please provide me the code I would need to use the collapse command on a large number of variables, many of which are strings (and, in the case of strings, for which I want NA returns)?

Answer 1

You can select numeric variables automatically with ds . ds is an official command. findname ( Stata Journal ) is a user-written successor to ds with more functionality (fact) and a friendlier syntax (author's opinion, although the same author was the last author of ds ).

. sysuse auto
(1978 Automobile Data)

. ds, has(type numeric)
price         rep78         trunk         length        displacement  foreign
mpg           headroom      weight        turn          gear_ratio

. findname, type(numeric)
price         rep78         trunk         length        displacement  foreign
mpg           headroom      weight        turn          gear_ratio

In both cases, you will find that the names of numeric variables are returned in r(varlist) :

. di "`r(varlist)'"
price mpg rep78 headroom trunk weight length turn displacement gear_ratio foreign

so that you feed that to collapse

. collapse `r(varlist)',  by(year country1 country2)

In general, there is no substitute for reading the help and manual entry for collapse .

Answer 2

If the string variables you are trying to compute a mean for are numbers treated as strings, eg "1", "2", etc., then you can convert the variable to numeric type using real() or destring . String variables not in this form, eg "alligator", "lizard", "snake", etc., for which you want no mean, will be dropped if they are not included in the collapse .

Example:

clear all
set more off

* some example data
input ///
str4 numstr num str11 reptiles
"234" 234 "alligator"
"2135" 2135 "lizard"
"324" 324 "snake"
end

list

* create numeric variable from string
destring(numstr), gen(num2)

* the collapse
collapse (mean) num num2

list

Data subsetting in Stata with integers and strings vs. R

Question

2 answers

solution1
2 2014-03-12 23:16:28

solution2
1 2014-03-12 22:25:15

Data subsetting in Stata with integers and strings vs. R

Question

2 answers

solution1 2 2014-03-12 23:16:28

solution2 1 2014-03-12 22:25:15

solution1
2 2014-03-12 23:16:28

solution2
1 2014-03-12 22:25:15