## Two articles on understanding statistical error

Today I want to share two articles today which call on the public to try to understand scientific error at a deeper level than we do now.

First, an academic journal called *Basic and Applied Social Psychology* (*BASP*) has decided to ban articles using p-values. This was written up in Nature news (hat tip Nikki Leger) with an excellent discussion of the good and bad things that might result. On the one hand, p-values are thoroughly gamed and too easy to achieve with repetitive testing, resulting in a corpus that is skewed towards such testing situations. On the other hand, if you get rid of p-values you have to replace them with something to give you an idea of whether a statistical result is interesting. Of course there are plenty others out there, but they too may quickly become gamed.

Second, The Big Story has an in-depth article on evidence-based sentencing and paroling models and what can go wrong there (hat tip Auros Harman). They focus on the fact that the people filling out the questionnaires can and do lie in order to game their scores and leave jail earlier. They also mention the fact that the scores are quite unrobust to small changes in input, specifically age. Finally, they punish people for being poor or for “hanging out with the wrong crowd” or even for having parents that went to jail.

I don’t see what sense this makes as a policy. If BASP wants to require some other statistical measure, that might be reasonable. But there has to be some method for trying to distinguish legitimate effects from effects due purely to chance, otherwise, you have to expect that you’ll end up publishing even more results due purely to chance than at present. Also, it is stated that authors can submit papers with p-values, and apparently the reviewers can use them in deciding whether to accept the paper; it’s just the readers who are protected from that dangerous knowledge.

LikeLike

A famous particle physicst, John Bahcall, said in around 2000, “Half of all three sigma results are wrong. Look at history.” He was talking about experiments in particle physics, where the models are more rigorous and the understanding of statistical analysis deeper than I think is true in almost any area of psychological or bio-medical science. And in those other areas there are basically no results with three sigma confidence levels. (That translates to p a whole hell of a lot less than .05 or even .005.) As the observation caught on in physics, they moved to demanding 5 sigma confidence levels, though it seems rather unclear why that should solve the problem. Reports from friends in physics indicates the standards for statistical analysis remain in a bit of flux and that there isn’t really a consensus on the fundamental nature of the problem. I presume there is some gaming and more not quite right models. I think the real problem for psychology is that very few people in the field understand that the p value reflects the likelihood of significance, provided some mathematical model of the experiment is correct. If the model is wrong, the p value is just garbage. Since the caveat is lost on these folks, banning them from using p values doesn’t seem absurd.

LikeLike

I know the P-Value debate will probably be around for quite a while. You see it manipulated all the time and recently I wondered what United Heatlhcare did with the Morcellator tool used for hysterectomies P-Value to give them substantiation to now ask that all those procedures now will need an approval unless it’s a vaginal procedure done on an outpatient basis? Yeah, insurers know how to fiddle with P Values for sure. A year or two ago there was a Plos One article written about what to look for with signs that the P Value has been fiddled with.

LikeLike

Reblogged this on Alina's Blog.

LikeLike

The level of confidence in gaussian-based statistical analysis is surely orders of magnitude higher in assessing well-understood physical modeling than it is in human-behavior-polluted speculative arenas, such as corporately-corrupted biomedical psychoactive-drug testing and hedge-funded contingent-claim financial derivatives.

Even in the arena of highly-deterministic aerospace sciences, however, there are instances where uncertainty in physical modeling requires statistical analysis. One such was orbiting the moon, with its unEarthly gravitational forces dominated by the variability of buried meteorites. The highly successful Kalman-Bucy solution of the ’60s was based on both frequentist and Bayesian concepts, and has been a mainstay, often unnoticed, in valid physical analyses ever since.

LikeLike

You may have already noticed that Gelman has a new post touching a bit on this subject:

http://andrewgelman.com/2015/03/02/what-hypothesis-testing-is-all-about-hint-its-not-what-you-think/

Anyway, more generally, you raise the whole issue of “gaming” a system, which seems ubiquitous these days — I especially think of gaming standardized tests taken by primary/secondary students (or ‘teaching to the test’). ‘Gaming,’ or manipulating things, seems to have drifted over from the business world to now infiltrate everything. And I wonder if “gaming” isn’t a necessary component of long-term success these days 😦

LikeLike

Taking trickster folk tales and myths as evidence, I guess that gaming a system is universal and as old as humanity.

I also like this spin, presented as principle 4: Principles of Economics, Translated

LikeLike

It seems to me that what you really want to have is documentation of the entire process that led to a result — so if your analysis sifted through complex data in a way that permitted for up to a thousand possible relationships, and turned up a handful of things that seem like relationships, we can point out that

this is what you should expect, even if the relationships are entirely phantasmagorical.The problem is that nobody keeps track of all of their negative results, especially not stuff that never even rises to the level of conscious rejection, but rather gets screened out algorithmically.

LikeLike