x<-rnorm(500, mean= 35, sd= 9) # randomly select 500 normal data with mean = 35 and sd= 9.
Histogram of the data
hist(x, prob=TRUE, col="grey")
curve(dnorm(x,mean=35,sd=9),0,70,add=TRUE,lwd=2,col="red")
what exactly we are are getting as a mean
and sd
?
mean(x)
## [1] 35.31345
sd(x)
## [1] 8.498389
Now we are going to select 500 random samples of size 5 from x
and and find average of each sample. We will have 500 sample means. Our interest is to analyze the distribution of these 500 averages.
mu= 35
sigma=9
n=5
#Sample mean
xbar = rep(0,500) # repeats zero 500 times
for (i in 1:500) { xbar[i]=mean(rnorm(n,mean=mu,sd=sigma)) }
hist(xbar, prob=TRUE , col="grey")
The distribution of these 500 sample averages is called Sampling Distribution
.
Note: If the parent distribution is normal then sampling distribution is also normal. Mathematically, if \(X \sim N(mean= \mu, sd= \sigma)\) then \(\bar {x} \sim N(mean= \mu, sd= \frac{\sigma}{\sqrt{n}})\). In the above example, \(n=5\)(sample size
)
Let \(Y\) be a beta distribution with parameters alpha = 12 and beta = 1.
y<- rbeta(1000, 12,1)
hist(y, prob=TRUE, col= "grey")
Note: above data is highly skewed with long left tail.
Now we are going to select 1000 random samples of size 10 from y
and and find average of each sample. We will have 1000 sample means. Our interest is to analyze the distribution of these 1000 averages this time.
alpha=12
beta=1
n=5
ybar=rep(0,10000)
for(i in 1:10000) {
ybar[i]= mean(rbeta(n, 12,1))
}
hist(ybar, prob=TRUE, col= "grey", main= "sampling distribution when n= 5")
Wow !! sampling distribution is also normal.
Note: Regardless of parent distribution the sampling distribution will be normal if we have enough sample (at least 30 !!
).
What happens if we increase the sample sizes from 5 to 10 , 15, 20 30 ?
y<- rbeta(1000, 12,1)
#for
n= 5
sd = sd(y)/ sqrt(5)
sd
## [1] 0.03243855
#for
n= 10
sd = sd(y)/ sqrt(10)
sd
## [1] 0.02293752
#for
n= 15
sd = sd(y)/ sqrt(15)
sd
## [1] 0.01872841
#for
n= 20
sd = sd(y)/ sqrt(20)
sd
## [1] 0.01621928
#for
n= 30
sd = sd(y)/ sqrt(30)
sd
## [1] 0.01324298
As the sample size increases the standard deviation of ybar
, also called standard error
decreases. Let’s see the sampling distribution of 1000 sample means of size 30 ( from the beta distribution).
alpha=12
beta=1
n=30
ybar=rep(0,10000)
for(i in 1:10000) {
ybar[i]= mean(rbeta(n, 12,1))
}
hist(ybar, prob=TRUE, col= "grey", main= "sampling distribution when n= 30")
This looks better !! If we increase sample size the standard error decreases (this implies higher symmetry and lower skewness in the sampling distribution). Thus, the histogram of the sampling distribution with saple size 30 looks more symmetrical than that of size 5.