This message is mainly for folks in the ACSM biostats interest group. It represents a clarification and resolution of an issue I brought up in a message on June 29 (Subject: Means of raw vs back-transformed variables).
I was concerned about the way stats packages like the Statistical Analysis System (SAS) use log transformation by default as part of so-called generalized linear modeling when the dependent variable is a frequency. I thought that log transformations might not be the most appropriate. Dwight The replied on July 1 (Subject: Will's questions/comments on GLM & transformations), basically telling me not to worry and giving a list of various transformations that might be appropriate for certain kinds of data.
I still wasn't convinced, and I couldn't really understand the documentation for Proc Genmod in SAS (can anyone?), so I decided to do some simulating to compare the performance of various Genmod approaches for analysis of two groups differing in the frequency of something (e.g., incidence of injury in two sports). With this sort of problem, the aim is to determine how accurately the stats procedure works out the confidence limits for a particular outcome statistic. So I analyzed several thousand data sets, each set consisting of counts of something in two groups. Each set had the same true proportions of injured players for the two groups (e.g. 0.3 in one sport, 0.2 in the other) and the same sample size (e.g. 30 in one sport, 100 in the other). In the simulation the observed counts of injury differed from set to set because of sampling variation. So, for each sample there was a different answer for the comparison of the proportions. A stats procedure gives you the answer for the SAMPLE and the confidence limits for the TRUE answer. If the stats procedure is doing its job well, the 90% (say) confidence limits will enclose the true value in 90% of the simulations. Or to put it another way, the Type 0 error--the frequency of occasions when the true value falls outside the confidence limits--should be 10%.
OK, I've done it for various proportions of injury and various sample sizes in the two groups, and for various kinds of Genmod analysis. To use Genmod, you specify the usual kind of linear statistical model (in this case, a model that predicts the proportion of injuries in each group). You also specify the distribution for the dependent variable, the proportion of injuries. In this case it's binomial, but Poisson also works in one extreme. The transformation in Genmod is specified as a "link function". The linear model applies to the transformed proportion, so once you have derived an estimate for the difference between the groups, you back-transform the estimate to a relative risk or an odds ratio, depending on whether you used the a log or logit link function. Or you can just keep the estimate as a difference in proportions, if you use the "identity" link function, which means no transformation at all. Why can't the people who write the documentation state it all in simple terms like this?
I also used the unequal-variances t statistic directly on the dependent variable scored as 0 for no injury and 1 for injury, an approach that works when there are enough counts for the central limit theorem to kick in. I threw in the equal-variances t statistic, because I knew it would fail sometimes.
Here are the findings.
The equal-variances t statistic was OK when injury frequencies were similar, but the Type 0 error was very wrong for some unequal rates and sample sizes. The unequal-variances t statistic performed incredibly well, except when the true value of at least one of the frequencies in the two groups was really low, ~4 or less.
I was hoping Genmod would work perfectly with a binomial distribution and any of the link functions. I tried the identity link function (to model the difference in proportions of injury in the two groups), the logit link function (to model the odds ratio for the comparison of the injuries) and the log link function (to model the relative risk). All did indeed work well when the sample sizes were large (~100) and the proportions of injury were high (~0.5). But the identity link came to grief with low frequencies, partly because some of the analyses "failed to converge". I haven't yet checked if there is a way to tweak Genmod to make it behave better in such circumstances. If anyone knows, please tell me. Surprisingly, for some low frequencies of injury, the log and logit links both produced Type 0 errors of 8-9% instead of the expected 10%, indicating that the confidence limits were a bit wider than they needed to be. Slightly larger P values for these analyses than for the unequal-variances t statistic fit with this finding. Again, there might be a tweak for Genmod that makes it give more exact confidence intervals.
Choosing the Poisson distribution for Genmod (and the log link function) produced Type 0 errors that were way off, except for large sample sizes and low proportions of injury. This finding fits with the fact that the Poisson distribution is the special case of the binomial distribution when sample size tends to infinity and proportion of injury or whatever tends to zero. You then get a finite count of injuries with no upper limit.
In summary, it would appear that the choice of link function depends on how you want to express your outcome: as a difference in proportions, as an odds ratio, or as a relative risk. Choice of link function clearly doesn't bias the effect statistic or give false confidence about its precision for a simple model like the comparison of injury in two groups. But choice of the link function WILL matter when you include covariates like age, because the link function will dictate whether the it is the proportion, the odds ratio, or the relative risk that changes per unit of the linear or polynomial covariate. Sounds right?
Contact me if you want a copy of the program and the listing of the simulations I saved.
Will