## Correlation between chocolate consumption and Nobel Prize winners per capita

From the inbox:

I honestly didn’t understand very much of what you told us on Tuesday, but now from this article I understand what a P-value is. You told us that we shouldn’t trust P-values, and this article is probably a good example of why we shouldn’t. I appreciated the relevancy anyway!

My response:

I actually believe that the relationship between chocolate and Nobel Prize winners per capita is real, i.e. the pvalue is accurate. But the key is that correlation does not mean causation. Instead, I think there is a perfectly rational explanation for why chocolate and Nobel Prize winners per capita are correlated. Namely, that countries that can afford to buy chocolate can also afford to employ individuals who eventually become Nobel Prize winners. In contrast, a country that is poor enough so that it cannot buy much chocolate has its work force performing much more immediately beneficial tasks, e.g. farming, manufacturing, etc.

Reply:

I understand what you’re saying. I knew of course that this correlation didn’t imply causation, but didn’t take the time to think about

whyit would be related. However, I didn’t even bat an eye at my friend’s comment that Jews have won the most Nobel Prizes, as the correlation of the Jewish culture and money/hard work immediately clicked in my head. I still think it is interesting that Swizerland (most chocolate) had the most Nobel Prize winners and even a country like China (much less chocolate) would have many less prize winners. I feel like your explanation is very rational but I wouldn’t guess that the correlation would be quite so close.

My reply:

Two points: 1) the research compared chocolate consumption (per capita?) vs Nobel Prize winners PER CAPITA. China has had 8 Nobel Prize winners which is the same number as Spain, but per capita China is much lower. 2) Notice that the author didn’t say anything about how high the correlation is, but only that it is significant. You can have a correlation that is 0.0001 and still have it be significant (if you have a lot of data). I found the actual article and you can see the data. The correlation is about 0.80 and the pvalue is <0.0001.

## Twitter Weekly Updates for 2012-10-06

- This is the command I always forget when setting up a new git repository: git remote add origin git@github.com:jarad/repositoryname.git #
- Let the passion follow you: http://t.co/zz4GJQIW #

## Detecting trends in low/high abundance species

From the inbox:

What is the best answer to this question posed below as a comment on the technical report we are writing regarding our forest bird trend data? If we have a bias in detecting trends for abundant or common species vs. uncommon or rare species, then I need to state this. I suspect that it is easier to detect a trend for a common species because there are more observations to work with … hence, more difficult for a rare species?

General statistical question – is detection of significant increase more likely than detection of significant decrease due in part to issues of sample size? Given that declining species probably are less common to begin with, wouldn’t it be more difficult to detect significant trends for those species?

My response:

It sounds like two different questions to me:

- Is it easier to detect an increase than to detect a decrease?

No, since the problem is symmetric. Take the exact same dataset and reverse the years, if there is a significant increase in one direction, it will be a significant decrease in the opposite direction.

- Is it easier to detect trends in more abundant species?

The answer is complicated. On one hand, an average of one individual increase per year is easier to detect in a rare species than in a common species. On the other hand, an average of a 10% increase in individuals per year is easier to detect in a common species than in a rare species. This is due to the signal-to-noise ratio which is high for the rare species in the former case but high for the common species in the latter. It is not clear to me that either of these are fair comparisons. I'm sure we could determine the break even point which will depend on how rare is rare and how common is common as well as how many observations are taken for each mean (iirc the data point for each year is the mean of all surveys in that season).

A reply:

Does the abundance matter, or is it only the signal:noise ratio as it would seem. i.e. greater ratio gives greater power regardless of abundance.

My response:

My answer was based on assuming a Poisson model for each survey with mean that changes from year to year. This mean would effectively be the abundance.Since the Poisson distribution has a variance that is equal to the mean, the noise (let's define it as the square root of the abundance) is directly related to the abundance. So yes, abundance matters through its effect on the noise.

If abundance starts at 1 and increases 1 per year, then over 9 years the signal is 9 while the noise ranges from 1 to 3. In contrast, if abundance starts at 100 and increases 1 per year, then over 9 years the signal is 9 while the noise is ~10. The former has a signal-to-noise ratio about 3 while the latter is about 1. But this makes sense since a 1 individual increase is much more meaningful if you only started with 1 than if you started with 100. So to try and make the comparison reasonable, let abundance in the latter case increase 100 each year (so that the relative increase in the two scenarios is the same). Now over 9 years the signal is 900 while the noise ranges from 10 to 30. So the signal to noise ratio is ~30. The break even point here is if the high abundance scenario increases by 30 over the 9 years because then the signal would be 30 and the noise would be ~10 and thus the signal-to-noise about 3.

Let's try a simulation and see what happens. My goal here is to create a situation where the pvalue associated with a linear increase in Poisson observations is approximately the same when you start at a Poisson mean of 1 and a Poisson mean of 100.

Consider data over 10 years.

```
n = 10
x = 1:n
```

Simulate counts for a rare species and perform a regression on year (x).

```
set.seed(1)
lambda = 1:n
y = rpois(n, lambda)
summary(lm(y ~ x))
```

```
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.982 -1.136 -0.173 1.395 2.454
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.200 1.276 -0.94 0.37435
## x 1.436 0.206 6.99 0.00011 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.87 on 8 degrees of freedom
## Multiple R-squared: 0.859, Adjusted R-squared: 0.842
## F-statistic: 48.8 on 1 and 8 DF, p-value: 0.000114
```

Simulate counts for a common species (using the same random number seed to make the comparison more direct) and perform a regression on year (x).

```
set.seed(1)
lambda = seq(100, 190, length = n)
y = rpois(n, lambda)
summary(lm(y ~ x))
```

```
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.29 -6.88 3.21 8.74 9.74
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 96.20 7.65 12.58 1.5e-06 ***
## x 9.02 1.23 7.32 8.2e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.2 on 8 degrees of freedom
## Multiple R-squared: 0.87, Adjusted R-squared: 0.854
## F-statistic: 53.6 on 1 and 8 DF, p-value: 8.24e-05
```

I got a number of 90 rather than 30 increase for the high abundance case, but this seems in the ballpark.

The bottom line with this analysis would be that to detect a one individual increase per year when you started at 1 individual would be as hard as detecting a 3 to 9 individual increase per year when you started at 100 individuals.

All of this was based on a single Poisson observation each year. You take multiple surveys per year and take their mean. Taking the mean is going to decrease the noise by a factor of the square root of the number of surveys you take each year. In addition, there is probably more noise than a Poisson model would give due to weather, time of day, time of year, etc. It is not clear how these would ultimately impact your ability to detect a trend.

Another comment from the reply:

I believe the length of the series on the x-axis also matters a great deal, so that more years gives you a lot more power nonwithstanding signal-to-noise. But, that's a third question.

Agreed. If the increase is consistent then having more years will give you more power.

## Twitter Weekly Updates for 2012-09-29

- I've just updated my professional profile on LinkedIn. Connect with me and view my profile. http://t.co/yB2QzW4s #in #

## Twitter Weekly Updates for 2012-09-01

- some thoughts on Bayesian analysis of Big Data http://t.co/OJwWDhqf #

## Twitter Weekly Updates for 2012-08-18

- @hadleywickham becoming a fan of the #testthat R package http://t.co/5AGCRfW8 #

## Twitter Weekly Updates for 2012-08-04

- benchmarking BLAS implementations in R http://t.co/LFo9NPW1 #

## Twitter Weekly Updates for 2012-07-21

- just signed the petition to form the BayesComp section of ISBA http://t.co/kX7cbPAZ #

## Twitter Weekly Updates for 2012-07-14

- vastly different results for "counting the dead" due to infectious diseases and other causes http://t.co/LOedcyyd #
- per 1000 healthy men screened for PSA: 1 death prev, but 1 blood clot, two heart attacks, and 40 impot/incon prod http://t.co/X9R52q4Y #
- Amstat News suggestions for getting the most out of #jsm2012 http://t.co/sp15nzWw #
- whew…finally got scipy to install. used macports http://t.co/uigKIFuI #