Skip to main content

Statistics

Notes

Statistics is the method of measuring the validity and accuracy of our data. This helps us understand not only what are the traits of our data (for example what is the average height of the population), but also estimate how much of our sample is similar to the population.

Common Problems

Common issues we might face while analyzing our data, specifically those who challenge the validity of our finding:

  1. Correlation is not causation - Correlation between variables doesn't indicate a causal link
  2. Garbage in garbage out - A model is only as good as the inputs provided
  3. Skewness - The data is not evenly distributed across possible values.
  4. Bias and Variance - Our results might either be biased, with high variance, or both.

Similarity

Most of statistical tests require an hypothesis (Hypothesis Testing ), meaning that we make an inference on our data (for example - group x and group y have similar distributions), and we compare the results to see if the hypothesis holds true or should it be rejected. Note that an hypothesis can only be disproven (and not proven).

we compare Population and sample estimates by using P Value, which helps us measure how rare are the results we have. The lower the P value, the less chance that we got a significant difference "by mistake". We can also create Confidence Interval to see what are the likelihood of the actual value to be within certain values. For example, a 90% chance that the value is between 1 - 1.5.

Distribution

A good measure of a distribution of a variable across a sample is to use Histograms. This would also show us which Statistical Distribution the variable has. No matter which distribution it has, if we do repeated sampling (bootstrapping ), than according to the central limit theorem , we will have a normal distribution for the sample means, which gives us confidence that our random sample is a good representation of the entire population. Another contributing factor is Regression towards the mean, that given a large enough sample data, we would expect to see values center around the mean.

Probability

While the world is full of Randomness , we can still calculate the probabilities of events happening. Note that it is worth distinguishing between Probability vs Likelihood , and consider how probability changes once we acquire new information, or consider the context of the probability (conditional probability )

Books

Calling Bullshit (book)

Youtube

StatQuest

Courses

Statistics for Data Science (course)

Other MOC

Overview

Join the Journey

Philosopher's Code offers practical philosophy for everyday life

Unsubscribe at any time