Explaining Linear Regression with Hypothesis Testing and Confidence Interval

Khem Sok
5 min readJun 13, 2020
“A close examination of random samples can be one of the most effective means of making sense of something too complex to be comprehended directly.”

Hypothesis Testing

Hypothesis testing is used to prove how likely or unlikely a particular event could happen based on an initial assumption.

For example, let’s take a sample of 100 high school students on their preference of vanilla or chocolate ice cream. With that sample, you can calculate its sample statistics. Let’s say 60% or 0.6 of the students prefer chocolate over vanilla based upon the sample. Now we want to see how likely our sample is bound to occur given some initial assumption. If our initial assumption is that students prefer chocolate 55% or 0.55 over vanilla, then our hypothesis testing helps us answer the following question:

How likely is our sample proportion, 0.6, bound to occur given that the true population of all students in the universe likes chocolate ice cream 55% of the time over vanilla?

If the likelihood of our sample proportion occurring (p-value) is lower than the preconfigured significance level (alpha), then we are able to reject our initial assumption or our null hypothesis. This is the basis of hypothesis testing.

Confidence Interval

Confidence interval is used to estimate the range of the true population parameter.

For example, there is an election going on and we would like to know the proportion of people that will either vote for candidate A or candidate B. We can’t possibly survey the entire population, so we take a random sample of 100 people. Let’s say 55% or 0.55 of the people that we randomly sampled liked candidate A. We can’t say 0.55 of the entire population likes candidate A over candidate B because this is just our sample proportion.

A confidence interval helps us to calculate a range where the true population proportion might exist given some confidence level. Let say we calculated a 99% confidence interval on the sample proportion and it gave us a range of 0.45–0.65. How can we interpret this value?

With 99% confidence, 55% of the population likes candidate A over candidate B with a margin of error of 10%.

Confidence interval gives us the ability to estimate the true population with some confidence based upon our sample data. I want to make it clear that confidence interval cannot be interpret as the following:

There is a 99% probability that the true population will fall within 0.45–0.65.

What the 99% is saying:

If you were to take a sample 100 times and use the same technique of calculating the confidence interval, 99 of those sample confidence intervals will contain the true population proportion.

I want to emphasize that the two statements are very different.

How are we able to just estimate the population proportion like this? The law of large numbers is the reason why. If we were to take 10,000 samples of sample size n, and calculate the proportion of each sample and plot it, we would get a sampling distribution of the sample proportion. And according to the law of large numbers, as the number of samples goes to infinity, the sample statistics will be approximately the population parameter.

The figure above explains this concept well. As you can see, both the population and sampling distribution have the same mean. We can use this characteristic to estimate the true population parameter, which in our case is the proportion. Since our sample proportion will be somewhere in the sampling distribution, we can create a range with a certain confidence level of where the sample proportion would be in regards to the population proportion using t-statistics.

How Does It Relate To Linear Regression?

Linear regression is a method of fitting a line to the data that will give the lowest error. Then we can use that line to predict future data.

For example, we are trying to predict an employee’s salary based upon years of experience. We have historical data of employees with their years of experience and salary. Next, we can either use gradient descent or the normal equation to find the line that gives us the lowest error. Essentially we are finding the y-intercept and slope that fits best and gives us the line y = mx + b. What is this line really telling us? In a simple and generic way, we can interpret this as the line that gives us the lowest error when we plot it against our sample data. We can expand on this idea by saying according to the sample data, this line gives us the best estimate of what we think the true population slope and y-intercept would be. Does this sound familiar to confidence interval and hypothesis testing? The idea is essentially the same. We have some samples that we’ve collected and we would like to estimate the true population parameters by computing statistics on the samples.

Hypothesis Testing/Confidence Interval: We are trying to estimate the true population proportion/mean given data from the samples.

Linear Regression: We are trying to estimate the true population regression slope/y-intercept given data from the samples.

This explains to us the idea of population and sample really well. While we often don’t know the true mean or proportion or regression slope of a population, we can estimate it relatively well by taking samples and use statistics on the sample to make inferences about the population. This basic idea here is really the backbone of what makes statistics is so powerful.

Conclusion

I hope you guys are able to gain a deeper intuition on how random samples are used to effectively make inferences on population, and why this is so powerful.

Thanks for reading and have a nice day! 🎯

--

--