Statistics notation:
First aim: To propose a model for a random variable.
Generalization to multi-dimensional case:
Aim: To find a functional link between \(Y\) and the explanatory variables.
To find this functional link , the method to apply depends on the nature of the r.v’s.
Y | Model |
---|---|
Numeric | Linear model |
Qualitative (labels) | Classification |
Linear model
A linear model is given by:
\(Y_i = \beta_0 + \beta_1X^{1}_i + ... + \beta_pX^{p}_i + \varepsilon_i\)
where:
Previous Knowledge
Given \(x_1, ..., x_n\) numeric observations, to try to find a correct parametric model, we can use 2 graphs:
Plot type | Variable type | Density |
---|---|---|
Bar plot | Discrete | count/n |
Histogram | Continuous | count for a bin/ n x (length of the bin) |
Barplot for discrete variables
<- c(22, 27, 26, 24, 23, 26, 28)
max.temp
barplot(max.temp,
main = "Maximum Temperatures in a Week",
xlab = "Degree Celsius",
ylab = "Day",
names.arg = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"))
Histogram for continuous variables
<- rnorm(500, 0, 1)
A
## The default execution of this function doesn't generate a density:
hist(A)
## You need to set freq = FALSE:
hist(A, freq = FALSE)
## You can set the numbers of bins that you want to use:
hist(A, freq = FALSE, breaks = 100)
# But in order to create more breaks, you need to increase the
# numbers of observations:
<- rnorm(50000, 0, 1)
A hist(A, freq = FALSE, breaks = 100)
To propose a parametric model:
Examples:
Pasarlos a R!!
Example 1:
Example 2:
Question: with a representation, we can guess a parametric family of models, denoted by \(\{P_{\Theta}, \theta \in \Theta \}\). How to guess a correct value for \(\theta\) thanks to the observations?
Answer: Estimation.
Let \(x_i\) an observation of a r.v \(X_i\) we assume that \(X_1, ..., X_n\) are i.i.d with common distribution \(P_{\theta}\).
Definition: An estimator of \(\Theta\) is just a function of \(X_1, ... X_n\) that does not depend onto others unknown parameters.
Remark: An estimator is a random variable!
Definition: An estimation is the value of an estimator computed thanks to the observations.
Example
Consider \(X_1, ..., X_n\) exponential distributed and i.i.d, an estimator of \(\lambda\) is \(\hat{\lambda}_n = \dfrac{n}{\sum X_i}\) an estimation is \(\hat{\lambda}_n = \dfrac{n}{\sum x_i}\).
Distribution | Parameter | Estimator | Estimation |
---|---|---|---|
Exponential \(\xi(\lambda)\) | \(\lambda\) | \(\dfrac{n}{\sum X_i}\) | \(\dfrac{n}{\sum x_i}\) |
Definition: Let consider \(\hat{\theta}_n\) an estimator of \(\theta\).
The bias of \(\hat{\theta}_n\) is defined by:
\[b(\hat{\theta}_n) := \mathbb{E}(\hat{\theta}_n) - \theta\]
\[\forall n \in \mathbb{N}^{+} \quad b(\hat{\theta}_n) = 0\]
\[b(\hat{\theta}_n) \rightarrow 0 \\ n \rightarrow +\infty\]
How to construct estimator?
Method of moments
less computations
based on the Law of large numbers
Maximum likelihood
Let \(\theta\) a parameter to estimated, parameter which is associate to
\(X_1, ..., X_n\) i.i.d r.v.
Let consider \(k \in \mathbb{N}^*\):
If there exist a value \(k\) such that:
Then the estimator \(\hat{\theta}_n\) of \(\theta\) is solution of:
where
\[\overline{X}_n = \dfrac{1}{n}\sum X_i\]
is the empirical mean.
Remark: Exponential Distribution
If Let \(X \sim \xi(\lambda)\), then:
Transfer formula:
\[\mathbb{E}[l(X)] = \int_{\mathbb{R}} l(x)\cdot f_\lambda(x) dx\]
Solution:
Let \(X \sim \xi(\lambda)\), so:
\[\mathbb{E}[X] = \dfrac{1}{\lambda} \quad \text{and} \quad V[X] = \dfrac{1}{\lambda^2}\] By applying the method of moments (\(k=1\)) we get:
\(\dfrac{1}{\hat{\lambda}_{n,1}} = \dfrac{1}{n}\sum X_i\)
Thus:
\[\hat{\lambda}_{n,1} = \dfrac{n}{\sum X_i}\]
in the same way but using the variance (\(k=2\)), we get:
\(\dfrac{1}{\hat{\lambda}_{n,2}^2} = \dfrac{1}{n}\sum (X_i - \overline{X}_n)^2\)
Thus:
\[\hat{\lambda}_{n,2} = \dfrac{\sqrt{n}}{\sqrt{\sum (X_i - \overline{X}_n)^2}}\]
= rexp(500, 4)
A
1/mean(A)
## [1] 4.032975
= c()
m
for (i in 1:50) {
= rexp(500, 4)
A <- 1/mean(A)
m[i]
}
mean(m)
## [1] 4.036872
boxplot(m)
## With more observations we got less variation (500 -> 5000)
## Law of large numbers
= c()
m
for (i in 1:50) {
= rexp(50000, 4)
A <- 1/mean(A)
m[i]
}
mean(m)
## [1] 3.995492
boxplot(m)
Solution:
Let \(X \sim \mathcal{U}([0, \sigma])\).
\[\begin{align*} \mathbb{E}(X) = \int_{\mathbb{R}} x\cdot f_X dx = \int_{\mathbb{R}} x\cdot \dfrac{1}{\sigma}\cdot \mathbf{1}_{[0, \sigma]} dx = \dfrac{1}{\sigma}\int_{0}^{\sigma} x dx = \dfrac{1}{\sigma} \dfrac{x^2}{2} \Big|_0^\sigma = \dfrac{\sigma}{2} \end{align*}\]
By the method of moments, we get that an estimator \(\hat{\sigma}_n\) is solution of:
\(\dfrac{\hat{\sigma}_n}{2} = \dfrac{1}{n}\sum X_i = \overline{X}_n\)
Thus:
\[\hat{\sigma}_n = 2\overline{X}_n\] since \(\hat{\sigma}_n\) it is a function of \(X_1, ..., X_n\) it’s an estimator.
Let’s compute the bias for this estimator, let \(n \in \mathbb{N}^*\).
\(b(\hat{\sigma}_n ) = \mathbb{E}(\hat{\sigma}_n) - \sigma\)
First, compute the expected value of \(\hat{\sigma}_n\):
\[\begin{align*} \mathbb{E}(\hat{\sigma}_n) &= \mathbb{E}(2\overline{X}_n)\\ &= \mathbb{E}\bigg[\dfrac{2}{n}\sum X_i\bigg]\\ &= \dfrac{2}{n}\sum\mathbb{E}[ X_i] \quad\text{(by linearity)}\\ &= \dfrac{2}{n} \cdot n \cdot\mathbb{E}[X_1] \quad\text{(because are identically distributed)}\\ &= \sigma \end{align*}\]
Therefore \(b(\hat{\sigma}_n ) = 0\) and then we conclude that \(\hat{\sigma}_n = \overline{X}_n\) is an unbiased estimator for \(\sigma\).
Since \(\mathbb{E}[X_1] = 0 \quad \text{(not a function of } \sigma)\)
we can’t use the first moment, then we use the second moment:
\(V[X_1] = \dfrac{\sigma^2}{3}\)
then we have:
\[\hat{\sigma}_n = \sqrt{\dfrac{3}{n}\sum(X_i - \overline{X}_n)^2}\]
Definition: Let \(X_1, ..., X_n\) independent random variables, whose distributions are all depending on the same parameter \(\theta\).
Let \(x_1, ..., x_n\) observations of those r.v
\[ \mathcal{L}(x_1, ..., x_n, \theta) = \begin{cases} \displaystyle \prod_{i =1}^{n} P_{\theta}(X_i = x_i) \quad \text{(discrete framework)} \\ \displaystyle \prod_{i =1}^{n} f_{X_i, \theta}(X_i = x_i) \quad \text{(continuous framework)} \end{cases} \]
where \(f_{X_i, \theta}\) are the density function of \(X_i\).
Definition: \(\hat{\theta}_n\), an estimator for \(\theta\), due to the maximum likelihood, is solution of:
\[ \mathcal{L}(x_1, ..., x_n, \theta) = \max_{\theta} \mathcal{L}(x_1, ..., x_n, \theta) \]
Solution:
Let \(x_1, ..., x_n \in \mathbb{R}\).
\[\begin{align*} \mathcal{L}(x_1, ..., x_n, \theta) &= \displaystyle \prod_{i =1}^{n} \lambda e^{-\lambda x_i}\cdot\mathbf{1}_{x_i \geq 0} \\ &= \lambda^n e^{-\lambda \sum x_i}\cdot \mathbf{1}_{\min{(x_i) \geq 0}} \end{align*}\]
we need to maximize with respect to \(\lambda\), since \(\mathbf{1}_{\min(x_i)}\) does not depend on \(\lambda\) we can forget it, and consider:
\[h(\lambda) = \lambda^n e^{-\lambda \sum x_i}\]
taking \(\log\) in both sides, with get a better expression to deal with.
\[g(\lambda) = \log(h(\lambda)) = \log(\lambda^n e^{-\lambda \sum x_i}) = n\log(\lambda) - \lambda \sum x_i\] Since
\(g'(\lambda) = \dfrac{n}{\lambda} - \sum x_i\) then \(\lambda = \dfrac{n}{ \sum x_i} = \dfrac{1}{\overline{X}_n}\) is critical point.
and \(g''(\lambda) = \dfrac{-n}{\lambda^2} <0\) then this critical point correspond to a maximum.
So, \(\lambda = \dfrac{1}{\overline{X}_n}\) is solution of the maximization problem and therefore the Maximum likelihood estimator is:
\[\hat{\lambda}_n = \dfrac{n}{ \sum X_i}\]
Solution:
Compute the maximum likelihood estimator:
\[\begin{align*} \mathcal{L}(x_1, ..., x_n, \theta) &= \displaystyle \prod_{i =1}^{n} f_{\theta}(x_i) \\ &= \prod_{i =1}^{n} \dfrac{1}{\theta}\cdot \mathbf{1}_{[0, \theta]}(x_i) \\ &= \dfrac{1}{\theta^n} \cdot \mathbf{1}_{\min (x_i) \geq 0} \cdot \mathbf{1}_{\max(x_i) \leq \theta} \end{align*}\]
Since \(\mathbf{1}_{\min (x_i) \geq 0}\) does not depend on \(\lambda\) we can forget it, and consider:
\[g(\lambda) = \dfrac{1}{\theta^n} \cdot \mathbf{1}_{\max(x_i) \leq \theta} = \dfrac{1}{\theta^n} \cdot \mathbf{1}_{[\max(x_i), +\infty[} (\theta)\]
we can’t compute \(g'\) since this function it’s not derivable in all points.
But since the function \(\dfrac{1}{t^n}\) it’s decreasing we can conclude that the maximum estimator likelihood is given by:
\[ \hat{\theta}_n = \max(X_i) \]
Compute the bias of this estimator and transform it into an unbiased one.
In order to compute the bias of this estimator, we have to compute: \(\mathbb{E}(\hat{\theta}_n) = \mathbb{E}(\max(X_i))\).
To do this we need to determine the density function of this new random variable \(\max(X_i)\).
How to compute a density?
Let \(t \in \mathbb{R}\).
\[\begin{align*} F_{\hat{\theta}_n}(t) &= P(\hat{\theta}_n \leq t) \\ &= P(\max(X_i) \leq t) \\ \iff \\ & P(X_1 \leq t, ..., X_n \leq t) \\ &= \displaystyle \prod_{i =1}^{n} P(X_i \leq t) \quad \text{(by independency of the r.v)} \\ &= (P(X_1 \leq t))^n \quad \text{(because they are identically distributed.)} \end{align*}\]
Since
\[ P(X_1 \leq t) = \int_{-\infty}^{t} \dfrac{1}{\theta} \mathbb{1}_{[0, \theta]}(x)dx = \begin{cases} 0 & t < 0 \\ \\ \dfrac{t}{\theta} & t \in [0,\theta] \\\\ 1 & t >1 \end{cases} \]
We conclude:
\[ F_{\hat{\theta}_n}(t) = \begin{cases} 0 & t < 0 \\\\ \dfrac{t^n}{\theta^n} & t \in [0,\theta] \\\\ 1 & t >1 \end{cases} \]
Using the relation between the distribution and density function we have:
\[ f_{\hat{\theta}_n}(t) = F'_{\hat{\theta}_n}(t) = n\cdot\dfrac{t^{n-1}}{\theta^n}\cdot\mathbf{1}_{[0, \theta]}(t) \]
\[ \mathbb{E}[\hat{\theta}_n] = \int_{\mathbb{R}} x \cdot f_{\hat{\theta}_n}(x) dx = \dfrac{n}{n+1} \theta\]
Now we have:
\[ b(\hat{\theta}_n) = \mathbb{E}[\hat{\theta}_n] - \theta = -\dfrac{\theta}{n+1} \neq 0\]
Therefore \(\hat{\theta}_n\) is not an unbiased estimator for \(\theta\). But since \(\displaystyle \lim_{n\to\infty} b(\hat{\theta}_n) = 0\) it is an asymptotically unbiased estimator for \(\theta\).
How to obtain an unbiased estimator?
We have \(\mathbb{E}[\hat{\theta}_n] - \theta = -\dfrac{\theta}{n+1}\), then:
\[\begin{align*} \mathbb{E}[\hat{\theta}_n] - \theta + \dfrac{\theta}{n+1} &= 0 \\ \iff & \\ \mathbb{E}\bigg[\hat{\theta}_n + \dfrac{\theta}{n+1}\bigg] - \theta &= 0 \\ \end{align*}\]
It is ok to consider \(\tilde{\theta}_n = \hat{\theta}_n + \dfrac{\theta}{n+1}\) as unbiased estimator of \(\theta\)?
The answer is no, \(\tilde{\theta}_n\) depends on an unkonown parameter (\(\theta\) in this case), so by definition it is not an estimator!
We have:
\[\begin{align*} \mathbb{E}[\hat{\theta}_n] - \theta =- \dfrac{\theta}{n+1} &= 0 \\ \iff & \\ \mathbb{E}[\hat{\theta}_n] = \theta - \dfrac{\theta}{n+1} \\ \iff & \\ \mathbb{E}[\hat{\theta}_n] = \dfrac{n}{n+1}\theta \\ \iff & \\ \mathbb{E}\bigg[\dfrac{n+1}{n}\hat{\theta}_n\bigg] = \theta \\ \iff & \\ \mathbb{E}\bigg[\dfrac{n+1}{n}\hat{\theta}_n\bigg] - \theta = 0\\ \end{align*}\]
Thus, \(\dfrac{n+1}{n}\hat{\theta}_n\) is an unbiased estimator for \(\theta\).
## Comparing the estimators for the uniform distribution.
= 100
n
= runif(n, 0, 4)
U
= max(U)
theta
for (i in 1:50) {
= runif(n, 0, 4)
U = c(theta, max(U))
theta
}
mean(theta)
## [1] 3.957206
boxplot(theta)
## Adjust to make the estimator unbiased
= (n+1)/n*theta
thetab
mean(thetab)
## [1] 3.996778
boxplot(thetab)
## With more observations
= 100
n
= runif(n, 0, 4)
U
= max(U)
theta
for (i in 1:5000) {
= runif(n, 0, 4)
U = c(theta, max(U))
theta
}
mean(theta)
## [1] 3.960439
boxplot(theta)
Let \(X_1, ..., X_n\) i.i.d random variables.
A classical estimator for:
\(\mu\) is \[\hat{\mu}_n = \overline{X}_n\]
\(\sigma^2\) is
\[\hat{\sigma}_n^2 = \dfrac{1}{n} \sum(X_i - \overline{X}_n)^2\]
Exercise
Show that:
Solution:
\[\begin{align*} \mathbb{E}[\hat{\mu}_n] &= \mathbb{E}\bigg[\dfrac{1}{n} \sum X_i\bigg] \\ &= \dfrac{1}{n}\sum \mathbb{E}[X_i] \quad \text{(by linearity)} \\ &= \dfrac{1}{n}n\cdot \mathbb{E}[X_1] \quad \text{(as they are indentically distributed)} \\ &= \mathbb{E}[X_1] \end{align*}\]
Therefore: \(\hat{\mu}_n = \overline{X}_n\) is an unbiased estimator for \(\mu\).
First note that we can rewrite \(\hat{\sigma}_n^2\) as follows:
\[\begin{align*} \hat{\sigma}_n^2 &= \dfrac{1}{n} \sum_{i =1}^n(X_i - \overline{X}_n)^2 \\ &= \dfrac{1}{n} \sum_{i =1}^n \big(X_i^2 - 2X_i\overline{X}_n + \overline{X}_n^2\big) \\ &= \dfrac{1}{n} \sum_{i =1}^n X_i^2 - 2\overline{X}_n \bigg(\dfrac{1}{n} \sum_{i =1}^n X_i\bigg)+ \overline{X}_n^2 \quad \text{(check that factors that depends on i)} \\ &= \dfrac{1}{n} \sum_{i =1}^n X_i^2 - 2\overline{X}_n^2 + \overline{X}_n^2 \\ &= \dfrac{1}{n} \sum_{i =1}^n X_i^2 - \overline{X}_n^2 \\ \end{align*}\]
Now use this expression to compute the expectation:
\[\begin{align*} \mathbb{E}[\hat{\sigma}_n^2] &= \mathbb{E}\bigg[\dfrac{1}{n} \sum_{i =1}^n X_i^2 - \overline{X}_n^2\bigg] \\ &= \dfrac{1}{n}\sum_{i =1}^n\mathbb{E}[X_i^2] - \mathbb{E}\bigg[\overline{X}_n^2\bigg] \quad \text{(by linearity)} \\ &= \dfrac{1}{n}\sum_{i =1}^n (V[X_i] +\mathbb{E}[X_i]^2) - ( V[\overline{X}_n] +\mathbb{E}[\overline{X}_n]^2) \quad \text{(by using the variance formula)} \\ &= \sigma^2 + \mu^2 - V[\overline{X}_n] - \mu^2 \quad \text{(by substituting the expressions of the variance and expectation)} \\ &= \sigma^2 - V[\overline{X}_n] \end{align*}\]
Now lets compute \(V[\overline{X}_n]\):
\[\begin{align*} V[\overline{X}_n] &= V\bigg[\dfrac{1}{n} \sum_{i =1}^n X_i \bigg] \\ &= \dfrac{1}{n^2} V\bigg[\sum_{i =1}^n X_i \bigg] \quad \text{(by linearity)} \\ &= \dfrac{1}{n^2} \sum_{i =1}^n V[X_i] \quad \text{(because they are indepent)} \\ &= \dfrac{\sigma^2}{n} \end{align*}\]
Putting all together we get:
\[ b(\hat{\sigma}_n^2) = \mathbb{E}[\hat{\sigma}_n^2] - \sigma^2 = - V[\overline{X}_n] = - \dfrac{\sigma^2}{n} \]
so \(\hat{\sigma}_n^2\) is an biased estimator.
But we can modify it to obtain an unbiased version by noticed that:
\[ \mathbb{E}\bigg[\dfrac{n}{n-1}\hat{\sigma}_n^2\bigg]=\sigma^2\] Therefore:
\[\hat{\sigma}_{n-1}^2 = \dfrac{n}{n-1}\hat{\sigma}_n^2 = \dfrac{1}{n-1} \sum_{i=1}^n(X_i - \overline{X}_n)^2\]
is an unbiased estimator for \(\sigma^2\).
Let \(X_1, ..., X_n\) i.i.d with \(\mu = \mathbb{E}[X_1]\) and \(\sigma^2 = V[X_1]\).
\[ \hat{\mu}_n = \overline{X}_n \quad \text{and it's variance is} \quad \dfrac{\sigma^2}{n} \]
\[ \hat{\sigma}_{n-1}^2 = \dfrac{1}{n-1} \sum_{i=1}^n(X_i - \overline{X}_n)^2 \]
Definition: Let \(\theta\) an unknown parameter, let \(\hat{\theta}_n\) an estimator of \(\theta\), the mean quadratic error (MQE) is given by:
\[ MQE(\hat{\sigma}_n) = \mathbb{E}[(\hat{\sigma}_n - \sigma)^2] \]
Property:
\[MQE = V[\hat{\theta}_n] + ((b(\hat{\theta}_n)))^2\]
Proof:
Lets rewrite the \(MQE\) definition.
\[\begin{align*} MQE(\hat{\theta}_n) &= \mathbb{E}[(\hat{\sigma}_n - \sigma)^2] \\ &= \mathbb{E}[(\hat{\sigma}_n - \mathbb{E}[\hat{\sigma}_n] + \mathbb{E}[\hat{\sigma}_n] - \sigma)^2] \quad \text{(by adding and substracting yhe same thing)}\\ &= \underbrace{\mathbb{E}[(\hat{\sigma}_n - \mathbb{E}[\hat{\sigma}_n])^2]}_{V[\hat{\sigma}_n]} + \underbrace{\mathbb{E}[(\mathbb{E}[\hat{\sigma}_n] - \sigma)^2]}_{\mathbb{E}[b(\hat{\sigma}_n)]^2} +2\underbrace{\mathbb{E}[(\hat{\sigma}_n - \mathbb{E}[\hat{\sigma}_n])(\mathbb{E}[\hat{\sigma}_n] - \sigma)]}_{A_n} \end{align*}\]
So to proof our result we just need to proof that \(A_n = 0\).
\[\begin{align*} A_n &= \mathbb{E}[\underbrace{(\mathbb{E}[\hat{\sigma}_n] - \sigma)}_{b(\hat{\sigma}_n)} \cdot(\hat{\sigma}_n - \mathbb{E}[\hat{\sigma}_n])] \\ &= \underbrace{b(\hat{\sigma}_n)}_{\text{deterministic term}} \cdot \mathbb{E}[\underbrace{(\hat{\sigma}_n - \mathbb{E}[\hat{\sigma}_n])}_{\text{this is a centeret r.v.}}] \quad{\text{(the expectation of a centered r.v is zero.)}} \\ &= 0 \end{align*}\]
Definition: Let \(\theta\) an unknown parameter, \(\hat{\theta}_{n, 1}\) and \(\hat{\theta}_{n, 2}\) for \(\theta\).
\(\hat{\theta}_{n, 1}\) is better than \(\hat{\theta}_{n, 2}\) if
\[ \forall n \geq 1, \quad MQE(\hat{\theta}_{n, 1}) \leq MQE(\hat{\theta}_{n, 2}) \]
Let \(X_1, ..., X_n\) i.i.d \(\mathcal{U}([0, \theta])\), compare the following estimators to see if one is better than the others.
Remember:
\(f_{\hat{\theta}_{n, 2}}(t) = n\dfrac{t^{n-1}}{\theta^n} \mathbf{1}_{[0, \theta]}(t)\)
Solution:
Lets compute the \(MQE\) of each estimator.
\[\begin{align*} MQE(\hat{\theta}_{n,1}) &= V[\hat{\theta}_{n, 1}] \\ &= V[2\overline{X}_n] \\ &= 4V[\overline{X}_n] \\ &= 4\sum_{n=1}^nV[X_1] \quad \text{(by using i.i.d)} \\ &= 4\dfrac{\sigma^2}{12n} \quad \text{(check the variance of an uniform distribution)} \end{align*}\]
Thus:
\[MQE(\hat{\theta}_{n,1}) = \dfrac{\sigma^2}{3n}\]
We have to compute \(V[\hat{\theta}_{n,2}]\), since
\[ V[\hat{\theta}_{n,2}] = \mathbb{E}[(\hat{\theta}_{n,2})^2] - \mathbb{E}[\hat{\theta}_{n,2}]^2 \] and
\[ \mathbb{E}[(\hat{\theta}_{n,2})^2] = \int_{\mathbb{R}}n\dfrac{t^{n-1}}{\theta^n} \mathbf{1}_{[0, \theta]}(t) dt = \dfrac{n}{n+2}\theta^2 \]
Therefore:
\[\begin{align*} MQE(\hat{\theta}_{n,2}) &= V[\hat{\theta}_{n,2}] - b(\hat{\theta}_{n,2})^2 \\ &= \dfrac{n}{n+2}\theta^2 - \dfrac{n^2}{(n+1)^2}\theta^2 +\dfrac{1}{(n+1)^2}\theta^2 \\ &= \dfrac{2}{(n+1)(n+2)}\theta^2 \end{align*}\]
\[\begin{align*} MQE(\hat{\theta}_{n,3}) &= V[\hat{\theta}_{n, 3}] \\ &= V\bigg[\dfrac{n+1}{n}\hat{\theta}_{n, 2}\bigg] \\ &= \dfrac{(n+1)^2}{n^2}V[\hat{\theta}_{n, 2}] \\ &= \dfrac{(n+1)^2}{n^2}\cdot \dfrac{n}{(n+1)^2(n+2)}\theta^2 \\ &= \dfrac{1}{n(n+2)}\theta^2 \end{align*}\]
It is easy to check that for \(n\geq 1\):
\[ 3n \leq \frac{(n+1)}{2}(n+2) \leq n(n+2) \]
Thus
\[ MQE(\hat{\theta}_{n,3}) \leq MQE(\hat{\theta}_{n,2}) \leq MQE(\hat{\theta}_{n,1}) \]
and we conclude that \(MQE(\hat{\theta}_{n,3})\) it’s the best estimator for \(\theta\).
Let \(\theta\) an unknown estimator and \(\hat{\theta}_n\) an unbiased estimator of \(\theta\).
Cramer-Rao bound:
\[ V[\hat{\theta}_n] \geq \dfrac{1}{I_{n}(\theta)} \]
where
\[ I_n(\theta) = - \mathbb{E}\bigg[\dfrac{\partial^2 \log(\mathcal{L(x_1, ..., x_n; \theta)})}{\partial \theta^2}\bigg] \]
is the Fisher Information.
This is valid under certain conditions:
\[ \lambda\cdot\hat{\theta}_n \]
with \(\lambda \in \mathbb{R}^*\) and find \(\lambda\) that minimize:
\[MQE(\lambda \hat{\theta}_n)\]
\[ \hat{\theta}_{n, a, b} = a\hat{\theta}_{n, 1} + b\hat{\theta}_{n, 2} \]
\[ M_{ab} := MQE(\hat{\theta}_{n, a, b}) \]
If \(\hat{\theta}_{n, 1}\) and \(\hat{\theta}_{n, 2}\) are unbiased estimators and you want to keep \(\hat{\theta}_{n, a, b}\) unbiased, take \(b = 1-a\).
Let \(X_1, ..., X_n\) i.i.d \(\mathcal{P}(\lambda)\).
Compute an estimator for \(\lambda\) by:
Solution
One estimator is:
\[ \hat{\lambda}_{n, 1} = \mathbb{E}[X] = \dfrac{1}{n}\sum_{i=1}^n X_i \]
another estimator for \(\lambda\) is given by:
\[ \hat{\lambda}_{n, 2} = \mathbb{E}[X] = \dfrac{1}{n}\sum_{i=1}^n (X_i - \hat{X}_n)^2 \]
\[\begin{align*} \mathcal{L}(x_1, ..., x_n; \lambda) &= \prod_{i=1}^n P(X_1 =x_i) \quad (\text{identically distributed}) \\ &= \prod_{i=1}^n \bigg( \dfrac{\lambda^{x_i}e^{-\lambda}}{x_i!}\bigg) \\ &= \lambda^{\sum x_i}e^{-n\lambda} \underbrace{\dfrac{1}{\prod_{i=1}^nx_i!}}_{\text{Not depends on }\lambda} \\ \end{align*}\]
since we need to minimize over \(\lambda\) and \(\log\) is an increasing function, we consider:
\[ g(\lambda) = \sum{x_i}\cdot\log(\lambda) - n\lambda \]
now, find the critical to find the critical points, we solve
\[ g'(\lambda) = 0 \iff \lambda = \dfrac{1}{n}\sum x_i \]
by analysing the sign of \(g'\) we can conclude that this point is a maximum and therefore:
\[ \hat{\lambda}_{n, 3} = \dfrac{1}{n}\sum_{i=1}^n X_i \]
is an estimator for \(lambda\) (the same obtained by the Methods of Moments).
Load the data.
<- as.matrix(read.table("data/data1.txt"))
data1
## Some comments about the data and data types.
In this two graph we see that we can’t determine just by looking at the graph a precise value for the parameters.
## Set freq = FALSE in order to obtain a correct graph.
## Be careful about getting conclusions, since the function make the
## bins automatically.
hist(data1, freq = FALSE)
## Create a personalize breaks
= min(data1)
data1_min = max(data1)
data1_max = seq(data1_min, data1_max, length = 10)
l
hist(data1, freq = FALSE, breaks = l)
We guess a Gaussian distribution.
How to estimate the parameters of the distribution?
Compute the empirical mean and statistical variance.
<- mean(data1)
mhu
# When you compute the variance be careful and read the documentation
# in this case we have:
# "The denominator n - 1 is used which gives an unbiased estimator of the (co)variance
# for i.i.d. observations"
<- var(data1)
sigma2
## You also can do it manually:
<- 1/(nrow(data1)-1)*sum((data1 - mean(data1))^2) sigma2
Plot the theoretical density that we guess.
To this graph we use the information of the histogram associated to the data.
## This generates a list, with all the hist information.
<- hist(data1, freq = FALSE, plot = FALSE) ## ignore warning message
H H
## $breaks
## [1] 1.70 1.75 1.80 1.85 1.90 1.95 2.00 2.05 2.10 2.15 2.20 2.25 2.30 2.35
##
## $counts
## [1] 6 20 34 94 135 226 169 148 106 44 13 4 1
##
## $density
## [1] 0.12 0.40 0.68 1.88 2.70 4.52 3.38 2.96 2.12 0.88 0.26 0.08 0.02
##
## $mids
## [1] 1.725 1.775 1.825 1.875 1.925 1.975 2.025 2.075 2.125 2.175 2.225 2.275
## [13] 2.325
##
## $xname
## [1] "data1"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
Comparison graph
## Histogram info to use.
<- H$breaks
limits <- limits[1]
lmin <- limits[length(limits)]
lmax
## Create our plot
## Create a vector that represents the [lmin, lmax] interval.
<- seq(lmin, lmax, by = 0.01)
x
## Compute the density of the vector x associated to the parameters mhu y sigma2.
<- dnorm(x, mhu, sqrt(sigma2))
y
## Set the y_max as the max in the empirical density and the 'artificial' density.
= max(y, H$density)
y_max
## Make a plot for the comparison
hist(data1, freq = FALSE, xlim = c(lmin, lmax), ylim = c(0, y_max*1.01))
par(new = TRUE)
plot(x, y, type = 'l', col = 'red'
xlim = c(lmin, lmax)
, ylim = c(0, y_max*1.01)
, xlab = ""
, ylab = "") ,
Test the goodness of fitness.
ks.test(x, 'pnorm', mhu, sqrt(sigma2))
##
## One-sample Kolmogorov-Smirnov test
##
## data: x
## D = 0.25147, p-value = 0.0003644
## alternative hypothesis: two-sided
## Outputs:
## The D: notion of distance between the data and the guess distribution.
## The p-value: Used to reject or accept the propose model.
Model values vs Real values
## Model values
#mhu
#sigma2
## Real values
= 2.0
real_mhu = 0.1
real_sigma
cat('model_mhu:', mhu, 'vs', 'real_mhu:', real_mhu, '\n')
## model_mhu: 2.001483 vs real_mhu: 2
cat('model_sigma:', sqrt(sigma2), 'vs', 'real_sigma:', real_sigma)
## model_sigma: 0.09946156 vs real_sigma: 0.1
Load the data.
## Loading the data
<- as.matrix(read.table("data/data2.txt"))
A
= max(A) A_max
In this graph we can observe that there is potentially less observation in the last class, but if we look at the maximum of our dataset (7.099357), this last class exist because the maximum is slighter bigger than 7, so R automatically creates a new class just for this element.
## Make visualizations
hist(A, freq = FALSE)
To avoid this, we can use a strategy to creates better classes.
If we have a dataset with \(n\) observations.
The number classes classes is given by:
\[ K \approx 1 +3.22\times \log_{10}(n) \quad (\text{take the floor to get an integer value.}) \] The limits of the classes are given by:
\[ a_0 = \min(x_i) - \varepsilon \\ a_K = \max(x_i) + \varepsilon \]
the \(\varepsilon\) correction is to ensure that all the observations are contained in our partition.
Also the \(\varepsilon\) value depends on the order of magnitude of the data (numerator of the equation) and the precision of the software that we are using (denominator of the equation).
\[ \varepsilon = \dfrac{\max(x_i) - \min(x_i)}{10^9} \] the other break points can be constructed as follow:
\[ a_i = a_0 + i\times amp \quad \text{for} \quad i\in \{1, ..., K \} \]
where
\[ amp = \dfrac{a_K -a_0}{K} \quad (\text{length of the classes}) \] Lets use this in our dataset.
## Compute correct classes parameters:
= length(A)
n = floor(1 + 3.22*log(n, 10))
K = min(A)
A_min = max(A)
A_max
= (A_max - A_min)/K
amp
= (A_max - A_min)/10^9
epsilon
= seq(A_min, A_max, by = amp)
limits
hist(A, freq = FALSE, breaks = limits)
We guess is an uniform distribution \(\mathcal{U}[a, b]\).
In this cases we have two estimate 2 parameters, previously we had been working with the uniform distribution \(\mathcal{U}[0, \theta]\) and prove that \(2\overline{X}_n\) and \(\dfrac{n+1}{n}\max(X_i)\) are unbiased and \(\max(X_i)\) is a biased estimator for \(\theta\).
In order to find unbiased estimators for the parameters \(a, b\) of our distribution\(\mathcal{U}[a, b]\) we should make all the computations again.
In practice we can consider the naive estimators:
\[ a = \min(X_i) \\ b = \max(X_i) \]
or we can try to find unbiased estimators usign the Method of Moments:
Let \(X \sim \mathcal{U}[a, b]\), so:
\[\mathbb{E}[X] = \dfrac{a+b}{2} \quad \text{and} \quad V[X] = \dfrac{(b-a)^2}{12}\] By applying the method of moments we get:
\[\dfrac{\hat{a}_n + \hat{b}_n}{2}= \overline{X}_n \quad \text{and} \quad \dfrac{(\hat{b}_n - \hat{a}_n)^2}{12} = \hat{\sigma}_n^2\] solving this system for \(\hat{a}_n\) and \(\hat{b}_n\) we obtain:
\[ \hat{a}_n = \dfrac{1}{2}\bigg(2\overline{X}_n - \sqrt{12\hat{\sigma}_n^2}\bigg) \\ \hat{b}_n = \dfrac{1}{2}\bigg(2\overline{X}_n + \sqrt{12\hat{\sigma}_n^2}\bigg) \]
For this case we are going to use the Naive estimators, and compare again the Model values vs Real values.
## Model values
= min(A)
a = max(A)
b
## Real values
= 2.5
real_a = 7.1
real_b
cat('model_a:', a, 'vs', 'real_a:', real_a, '\n')
## model_a: 2.503019 vs real_a: 2.5
cat('model_b:', b, 'vs', 'real_b:', real_b)
## model_b: 7.099357 vs real_b: 7.1
In progress
In progress
## Loading the data
<- as.matrix(read.table("data/data4.txt"))
A
barplot(table(A)/length(A))
## A binomial distribution can be approach by a Gaussian distribution.
## Methods of moments
## Don't forget to correct the estimation of np to be an integer.
## and then correct the ph value
= mean(A)
m = var(A)*(499/500)
s2
= 1-s2/m
ph = m/ph
nh
nh
## V1
## V1 10.0325
= 10
nh
= m/nh
ph
ph
## [1] 0.7988
In progress
## Loading the data
<- as.matrix(read.table("data/data5.txt"))
A
barplot(table(A)/length(A))
## A binomial distribution can be approach by a Gaussian distribution.
## Methods of moments
## Don't forget to correct the estimation of np to be an integer.
## and then correct the ph value
= mean(A)
m = var(A)*(499/500)
s2
= 1-s2/m
ph = m/ph
nh
nh
## V1
## V1 7.38796
= 7
nh
= m/nh
ph
ph
## [1] 0.1342857
## When p is small is very hard to get correct estimations of n when you
## don't have enough number of observations
## end day 2
Definition: Lets \(\theta\) an unknown parameter and \(\alpha \in [0, 1]\) fixed.
A confidence interval (there is no uniqueness) for \(\theta\), with confidence level \(100\times(1-\alpha) \%\), in a random interval \([A_n, B_n]\) such that:
\(A_n\) and \(B_n\) are functions of \((X_1, ..., X_n)\) i.i.d r.v whose distribution depends on \(\theta\).
\(P(\theta \in [A_n, B_n]) = 1- \alpha\)
Interpretation: It means that in \(100\times(1-\alpha) \%\) of the cases the true value of \(\theta\) will be into \([A_n, B_n]\).
= c()
m1
for (i in 1:50) {
= rnorm(500, 2, 1)
A
= c(m1, mean(A))
m1
}
boxplot(A)
## Increase the number of observations
= c()
m2
for (i in 1:50) {
= rnorm(2000, 2, 1)
B
= c(m2, mean(B))
m2
}
boxplot(B)
## Change the variance
= c()
m3
for (i in 1:50) {
= rnorm(2000, 2, 10)
C
= c(m3, mean(C))
m3
}
boxplot(C)
## Create a data frame with all the observations
= c(m1, m2, m3)
a = rep(c("A", "B", "C"), each = 50)
r
= data.frame(x = a, l = r) S
In the following plot we can see that the approximation depends on the number of observations and the variance of the sample.
boxplot(S$x ~ S$l)
Framework:
Construct a confidence interval for \(\mu\) (two sub cases):
We want to construct a confidence interval for \(\mu\) (two sub cases):
Construct a confidence interval for \(\sigma^2\).
Each time we want to construct a confidence interval, we need at first a point estimator. But, we need some information on this point estimator. The main one it’s distribution or an approximated distribution of it.
We want to estimate the expectation \(\mu\): a classical point estimator \(\hat{\mu}_n\) is:
\[ \hat{\mu}_n = \overline{X_n} \]
Since the \(X_i\) are independent (this is needed) Gaussian r.v:
\[ \overline{X_n} \sim \mathcal{N}(\mu, \sigma^2/n) \]
we make standardization:
\[ \sqrt{n}\cdot\dfrac{(\overline{X_n} -\mu)}{\sigma} \sim \mathcal{N}(0, 1) \]
Idea: We want to find \(A_n, B_n\), functions of \(X_1, ..., X_n\), such that:
\[P(\mu \in [A_n, B_n]) = 1 - \alpha\]
\(\overline{X}_n\) is an estimator for \(\mu\) (an unbiased and convergent one) a possible choice is:
\[ \begin{cases} A_n = \overline{X}_n - S_n \\ B_n = \overline{X}_n + T_n \end{cases} \]
So, we want to determine \(S_n\) and \(T_n\) such that:
\[\begin{align*} P(\mu \in [\overline{X}_n - s_n, \overline{X}_n+ T_n]) &= 1- \alpha\\ \iff \\ P(\overline{X}_n - S_ n \leq \mu \leq \overline{X}_n+ T_n) &= 1-\alpha \\ \iff \\ P(- T_ n \leq \overline{X}_n - \mu \leq S_n) &= 1-\alpha \\ \iff \\ P\bigg(\frac{\sqrt{n}}{\sigma} \cdot (-T_ n) \leq \sqrt{n}\cdot\dfrac{(\overline{X_n} -\mu)}{\sigma} \leq \frac{\sqrt{n}}{\sigma} \cdot S_n\bigg) &= 1-\alpha \end{align*}\]
since \(\sigma\) it’s known we can take
\[ \begin{cases} \frac{\sqrt{n}}{\sigma} \cdot (-T_ n)\\ \frac{\sqrt{n}}{\sigma} \cdot (S_ n) \end{cases} \]
as being constants.
So we have to find two constants with just one equation, this means infinite number of solutions.
= seq(0.1, 10, by =0.01)
x
= dchisq(x, 1)
y1 = dchisq(x, 2)
y2 = dchisq(x, 3)
y3 = dchisq(x, 4)
y4 = dchisq(x, 6)
y6 = dchisq(x, 9)
y9
= max(y1, y2, y3, y4, y6, y9)
y_max
plot(x, y1, type = 'l', col = 'yellow', xlim = c(0, 10), ylim = c(0, 0.5))
par(new = TRUE)
plot(x, y2, type = 'l', col = 'green', xlim = c(0, 10), ylim = c(0, 0.5))
par(new = TRUE)
plot(x, y3, type = 'l', col = 'black', xlim = c(0, 10), ylim = c(0, 0.5))
par(new = TRUE)
plot(x, y4, type = 'l', col = 'blue', xlim = c(0, 10), ylim = c(0, 0.5))
par(new = TRUE)
plot(x, y6, type = 'l', col = 'pink', xlim = c(0, 10), ylim = c(0, 0.5))
par(new = TRUE)
plot(x, y9, type = 'l', col = 'red', xlim = c(0, 10), ylim = c(0, 0.5))
= seq(-10, 10, by =0.01)
x
= dt(x, 1)
y1 = dt(x, 2)
y2 = dt(x, 3)
y3 = dt(x, 4)
y4 = dt(x, 6)
y6 = dt(x, 20)
y20
= dnorm(x)
y_norm
= max(y1, y2, y3, y4, y6, y9)
y_max
plot(x, y1, type = 'l', col = 'yellow', xlim = c(-10, 10), ylim = c(0, y_max))
par(new = TRUE)
plot(x, y2, type = 'l', col = 'green', xlim = c(-10, 10), ylim = c(0, y_max))
par(new = TRUE)
plot(x, y3, type = 'l', col = 'black', xlim = c(-10, 10), ylim = c(0, y_max))
par(new = TRUE)
plot(x, y4, type = 'l', col = 'blue', xlim = c(-10, 10), ylim = c(0, y_max))
par(new = TRUE)
plot(x, y6, type = 'l', col = 'pink', xlim = c(-10, 10), ylim = c(0, y_max))
par(new = TRUE)
plot(x, y20, type = 'l', col = 'red', xlim = c(-10, 10), ylim = c(0, y_max))
par(new = TRUE)
plot(x, y_norm, type = 'l', col = 'black', xlim = c(-10, 10), ylim = c(0, y_max))
## A t-student distribution with a big enough k approach a Gaussian distribution
## We compute N observations of \overline{X}_n.
## Number of r.v.
<- 5
n
## Number of observations
<- 50000
N
## Parameter exp distribution.
<- 0.7
lambda
## Create a matrix
<- matrix(rexp(n*N, lambda), ncol = n)
A
## Compute the mean for each row
<- apply(A, 1, mean)
M
## Visualization for M
hist(M, freq = F, breaks = 20)
###
<- sort(M)
Mo <- 1:N/N
y
<- max(Mo)
max_Mo
<- seq(0, max_Mo, by = 0.01)
x <- 1/lambda
mu <- 1/(lambda^2)/n
v
<- pnorm(x, mu, sqrt(v))
yt
plot(Mo, y, type = 'l', xlim = c(0, max_Mo), ylim = c(0,1))
par(new = TRUE)
plot(x, yt, type = 'l', xlim = c(0, max_Mo), ylim = c(0,1), col = 'red')