R Language Basic Statistics Cheatsheet

I’m not the biggest fan of the ::stats documention in R. It’s a little unclear and worded not so nicely for the layperson, or at least for the computer science side of the house.

So I put together a little cheatsheet for some simple statistics, computed with the R language.

Just don’t tell Taleb.

How Distribution Functions in R are Named

While I won’t be covering all of them, it’s useful to know how the naming convention works. Every distribution in R has four functions, basically four prefixes and the base name of the distribution.

  • p (“probability”): cumulative distribution function (“what is the probability above or below a cutoff?”)
  • q (“quantile”): inverse CDF (“what value do we find at, say, 80% of the way to the maximal value?”)
  • d (“density”): density function (gives us the “height” or y-value of distribution for a particular z-score - mainly useful in plotting)
  • r (“random”): random variable from base distribution (for random sampling)

So for the normal distribution, the base name is norm, thus:

  • pnorm (“probability”): the normal cumulative distribution function (CDF)
  • qnorm (“quantile”): the inverse normal CDF
  • dnorm (“density”): the normal density function (PDF)
  • rnorm (“random”): a random normal variable

With that said, here’s a few quick shortcuts.

Gaussian Normal Distribution (norm)

Below, μ refers to the mean, σ to the standard deviation, and we assume a normal (Gaussian) distribution:

Remember also that the z-score of some value drawn from this normal distribution is simply that value, minus the mean, all divided by the standard deviation ($z = \frac{x - \mu}{\sigma}$), giving us a unit-free measure of deviance from the mean (and therefore how likely that observation was).

How to compute Code
"Probability of this value or less" pnorm(value, mean=μ, sd=σ)
"Probability of this value or greater" pnorm(value, mean=μ, sd=σ, lower.tail=FALSE)
"Probability of more extreme (greater or less than) observation than value" 2 * pnorm(abs(value), mean=μ, sd=σ, lower.tail=FALSE)
"Probability of z-score or lower" pnorm(z)
"Probability of z-score or higher" pnorm(z, lower.tail=FALSE)
"Highest value associated with a given percentile" qnorm(percentile, mean=μ, sd=σ)
"Highest Z-score given a percentile" qnorm(percentile)
"Random sample of N values, drawn from normal distribution" rnorm(N, mean=μ, sd=σ)

You should note of course that:

pnorm(1) == pnorm(1, mean=0, sd=1) ==  pnorm(1, 0, 1)

because 1 (the value here) is both a real value and the z-score because we’ve centered our distribution at 0 with the unit (1) standard deviation.

Binomial Distribution

And here’s a couple quick ones for the binomial theorum, which, given a constant probability of success, p, and a number of trials n, gives the probability of exactly k successes.

How to compute Code
"Exactly k successes in n trials given success probability p" dbinom(k, size=n, p=p)
"k or more successes in n trials given success probability p" sum(dbinom(k:n, size=n, p=p))

Other distributions

You can also type help(Distributions) in R to get the full list of supported distributions, for which you see R handles many of these such distributions, all with the aforementioned four function types:

Fun fact: the Student’s t distribution was concieved by William Sealy Gosset during the time he was running many experiments brewing beer for Guiness in Ireland. Sadly, Guiness wouldn’t let him publish under his true name, so he went by the pen name of “Student”.

Happy computing!