Sampling Distribution: If the Parent Distribution is Normal

x<-rnorm(500, mean= 35, sd= 9) # randomly select 500 normal data with mean = 35 and sd= 9. 

Histogram of the data

hist(x, prob=TRUE, col="grey")
 curve(dnorm(x,mean=35,sd=9),0,70,add=TRUE,lwd=2,col="red")

what exactly we are are getting as a mean and sd?

mean(x)
## [1] 35.31345
sd(x)
## [1] 8.498389

Now we are going to select 500 random samples of size 5 from x and and find average of each sample. We will have 500 sample means. Our interest is to analyze the distribution of these 500 averages.

mu= 35
sigma=9
n=5

#Sample mean
xbar = rep(0,500) # repeats zero 500 times

for (i in 1:500) { xbar[i]=mean(rnorm(n,mean=mu,sd=sigma)) }
hist(xbar, prob=TRUE , col="grey")

The distribution of these 500 sample averages is called Sampling Distribution.

Note: If the parent distribution is normal then sampling distribution is also normal. Mathematically, if \(X \sim N(mean= \mu, sd= \sigma)\) then \(\bar {x} \sim N(mean= \mu, sd= \frac{\sigma}{\sqrt{n}})\). In the above example, \(n=5\)(sample size)

Sampling Distribution: If the Parent Distribution is Non-Normal

Let \(Y\) be a beta distribution with parameters alpha = 12 and beta = 1.

y<- rbeta(1000, 12,1)
hist(y, prob=TRUE, col= "grey")

Note: above data is highly skewed with long left tail.

Now we are going to select 1000 random samples of size 10 from y and and find average of each sample. We will have 1000 sample means. Our interest is to analyze the distribution of these 1000 averages this time.

alpha=12
beta=1
n=5

ybar=rep(0,10000)

for(i in 1:10000) {
ybar[i]= mean(rbeta(n, 12,1))
}

hist(ybar, prob=TRUE, col= "grey",  main= "sampling distribution when n= 5")

Wow !! sampling distribution is also normal.

Note: Regardless of parent distribution the sampling distribution will be normal if we have enough sample (at least 30 !!).

What happens if we increase the sample sizes from 5 to 10 , 15, 20 30 ?

y<- rbeta(1000, 12,1)
  #for
n= 5
sd = sd(y)/ sqrt(5)
sd
## [1] 0.03243855
#for
n= 10
sd = sd(y)/ sqrt(10)
sd
## [1] 0.02293752
#for
n= 15
sd = sd(y)/ sqrt(15)
sd
## [1] 0.01872841
#for
n= 20
sd = sd(y)/ sqrt(20)
sd
## [1] 0.01621928
#for
n= 30
sd = sd(y)/ sqrt(30)
sd
## [1] 0.01324298

As the sample size increases the standard deviation of ybar, also called standard error decreases. Let’s see the sampling distribution of 1000 sample means of size 30 ( from the beta distribution).

alpha=12
beta=1
n=30

ybar=rep(0,10000)

for(i in 1:10000) {
ybar[i]= mean(rbeta(n, 12,1))
}

hist(ybar, prob=TRUE, col= "grey", main= "sampling distribution when n= 30")

This looks better !! If we increase sample size the standard error decreases (this implies higher symmetry and lower skewness in the sampling distribution). Thus, the histogram of the sampling distribution with saple size 30 looks more symmetrical than that of size 5.