In this post we will get into detail of understanding Z-Score and what are its application with respect to Gaussian/Normal distribution . We will also discuss about Quantiles and implement it to see how a particular distribution is divided into different Quantiles.
If we try to understand about Z-score in layman language, then it basically shows about how far is a data point away from the mean.
If we try to understand it in a more technical way, then it states how many standard deviations above or below the mean is a particular value present.
The curve shown above is a Gaussian or Normal Distribution curve. The central portion of the curve is the Mean.
The portion of the curve that is one standard deviation away from the mean both on the left and right covers 68.16% of the portion. Similarly, the portion with two standard deviations away both on the left and right covers 95.44% of the portion and the portion with three standard deviations away both on the left and right covers 99.73% of the portion of the curve. This is basically the empirical formula for Gaussian Normal Distribution.
Now let’s take a Standard Normal distribution as shown above, which has mean as zero and standard deviation as 1. So, in that case a Z-score of +1 says that we are 1 standard deviation above the mean. If the it is +2 then we are 2 standard deviations above the mean and so on.
Similarly, for a Z-score of -1, says that we are 1 standard deviations below the mean.
Z-score of -2, says that we are 2 standard deviations below the mean and so on.
The Z-score formula for a sample would be as follows:
- x = score,
- µ = mean of the population,
- σ = Population Standard deviation
Now, let’s take an example to understand this concept better. Suppose we are considering the heights of student in a class. Let’s say the mean height of the students are 150 cm and the standard deviation is 10 and we have to find the probability of students who have heights greater than 165 cm –P (height >165cm)
Let’s see how the distribution looks in the below normal distribution curve:
So, if you take the above curve and try to map it out to a standard normal distribution curve then the value of 165 cm would fall 1.5 standard deviations above the mean.
The reason being, 150 is the mean, so 1 SD above the mean would be 160 and 2 SD above the mean would be 170, so 165 would be 1.5 SD above mean.
Using the Z-score table below, if we see the score for z which is 1.5 in our case, the corresponding value for that 0.9332 (circled in red) which means that the region of the curve which is less than 165 cm is 93.32% of the whole curve as shown below.
As the complete standard normal distribution would cover 100% of the area, so, the portion for which the P (height >165cm) would be 100 – 93.32 = 6.68 %
The probability of students in the class whose height is more than 165cm is around 6.68%
Z-score for One Sample
In the above example, we had considered the complete population.
To calculate the z-score for one sample as well. The formula for that also remains the same:
- xs = sample score,
- µs = sample mean,
- σs = sample standard deviation
The process for solving the z-score remains the same for samples.
Now that we have seen, the formula for calculating Z-score for one sample, let’s go ahead and understand how we could do this when we have multiple samples.
Z-score for Multiple Samples
The below formula would give the z-score when we have multiple samples.
- x is the sample mean,
- µ is the population mean,
- σ is the standard deviation,
- n is the number of samples.
Let’s take an example: The mean weight of students in a class is 150lbs with standard deviation of 3.0. What will be the probability of finding a random sample of 60 students with mean weight of 170lbs assuming the height is normally distributed.
- x =170,
- µ = 150,
- σ =3,
- n =60
So, as we are dealing with the sampling distribution of means, we had to include the standard error in the formula while calculating the z-score.
Another thing that we need to keep in mind here as well is the empirical formula that we discussed earlier.
According to the empirical formula, 99.73% of the values will fall under 3 standard deviations from the mean in a normal distribution and since our z-score value is 51.28 which means it is 51.28 standard deviations away from the mean, it shows that there are less than 1% probability that any sample of students will have mean weight is 170lbs.
Now that we have understood the concept of Z-score, let’s go ahead and see how we could implement it to detect Outliers.
We would also see another method of detecting Outliers which is through Using Quantiles.