A group of Stata commands and tricks that I’m always looking up because I use them but not every day, and which I figured other folks might also find useful.
Creating Dummy Variables in One Line
This one comes from In Case of Econ Struggles. Instead of (in the auto dataset)
sysuse auto, clear
gen good_car = 1 if mpg >= 25
replace good_car = 0 if good_car == .
instead write one line of code
gen good_car2 = (mpg >= 25)
One major disadvantage is that this is vulnerable to missing data issues. Specifically, as is well known, Stata (bafflingly) treats missing values as having a value of positive infinity. If there are missing values in the mpg variable, then the code will nevertheless treat them as having an mpg equal to or over 25. Simply changing to (mpg >= 25 & mpg != .) is better, as it now returns a 0 rather than a 1 for this indicator, but that may still give false confidence that we know what we are doing.
mpg is a bad example because it’s complete, so I’ll switch to rep78 (note: does anyone know what rep78 refers to?). If we want to preserve missing values as missing, then, one alternative would be
gen good_rep = cond(rep78 == ., ., cond(rep78 > 2 & rep78 != ., 1, 0))
using the little-known (at least to me; fifteen years of using the program and I’d never encountered it) cond() expression, which works pretty much like the Excel IF command. cond() syntax is just cond(x, a, b, [,c]) where x is an expression (could be a compound expression) and a is the value if true and b is the value if false. Usefully, c can be used to return a missing value; less than usefully, “If the first argument to cond() is a logical expression, that is, cond(x>2,50,70,.), the fourth argument is never reached.”
At this point, the one-line solution starts to look, to me at least, a little clunky and hard to remember compared to a two-line solution in which we simply blank out the new variable if the original one is missing
gen good_rep2 = (rep78>=2)
replace good_rep2 = . if rep78 == .
but you could imagine this being more useful if you had a more complicated setup.
Extracting Variable and Value Labels
Stata encourages good labeling practices but it can be difficult to extract those labels that analysts have so carefully put together. variable label and value label can be useful in getting that information back into a useful format (note this requires the coefplot package)
sysuse auto, clear
/// ssc install coefplot
/// if you need to add the most useful extension...
local xlab : variable label foreign
local ylab : variable label mpg
local x1lab `: label (foreign) 1'
local x0lab `: label (foreign) 0'
qui: reg mpg foreign
coefplot, title(Effect of `xlab' on `ylab') drop(_cons) xtitle(Difference between `x1lab' and `x0lab' )
(This is the bit of code I found myself looking up sufficiently often to write this page up, by the way.)
Creating Composite Categorical Variables
This is one of many useful tips by Nicholas J. Cox, the British geographer who deserves the Order of Lenin for his efforts to keep the knowledge proletariat going.
As Cox writes: “If you have two or more categorical variables, you may want to create one composite categorical variable that can take on all possible joint values.”
He suggests (again using the auto dataset)
sysuse auto, clear
egen both = group(foreign rep78), label
inlist() and inrange()
I can’t believe it took me so long to learn about inlist() and inrange(). Via Todd R. Jones’s online booklet:
keep if inlist(state, "AL", "AK", "AZ")
/// is equivalent to
keep if state=="AL" | state=="AK" | state=="AZ"
keep if inrange(distance, 10, 91)
/// is equivalent to
keep if distance>=10 & distance<=91
strpos() to Create New Variables
When working with string variables to create new indicator variables, strpos() allows much greater flexibility and avoids copying errors. Danger: you have to make sure the substring you want distinguishes exactly the categories you want
replace mobile = 1 if strpos(meta_operatingsystem,"Android") | /// strpos(meta_operatingsystem,"iPad") | ///
strpos(meta_operatingsystem,"iOS") | ///
strpos(meta_operatingsystem,"iPhone")
