# Estimation and Inference of Heterogeneous Treatment Effects using Random Forests
## Motivation/Gaps Filled by Causal Forest
1. An impediment to exploring heterogeneous treatment effects is the fear that researchers will iteratively search for subgroups with high treatment levels, and then report only the results for subgroups with extreme effects.
2. For high-dimension covariates, kernel methods, nearest-neighboring break down. And random forest can have a good performance in these problems.
## Treatment Causal effcts
## Tree 
### Compare with nearest-neighbors
The advantage of tree-based method is that their leaves can be narrower along the directions where the signal is changing fast and wider along the other directions. And nearest-neighbors method only cares about the distance which grant same weights to each dimension.
### Causal Tree and Causal Forest
## Honest Tree and Forests
1. Double-Sample Tree
   1. Divide the subsample into I and J. Using all data from J and only X or W observations of I( no Y) to choose the split.
   2. Using data only from I to estimate the leafwise response.
   3. For regression tree, splits are put by minimizing the MSE of predictions restricted so that each leaf of the tree must contain k or more I sample observations. 
   4. For causal tree, the splits are chosen by maximizing the variance of $\tau$ for data in J. And each leaf of the tree must contain k or more I sample observations of each treatment class.
2. propensity Tree
   1. Using only the treatment assignment indicator W and features X to place splits. Each leaf ust have k or more obs of each treatment class.
   2. Estimating $\tau$ by using Y of the data in the leaf containing x.
 ## Asymptotic Theorem for Random Forests
   In random forest, what we would like to estimate is the conditional mean  
   $$\mu(x)=E[Y| X=x]$$

### Bias of Regression Tree
Define diameter $diam(L(x))$ of leaf $L(x)$ as the length of the longest segment contained inside $L(x)$( I think it is the Euclidean distance) and $diam_j(L(x))$ is the longest segment that is parallel to the j_{th} axis. It is shown that $diam(L(x))\xrightarrow{p}0$, thus the diameter within a leaf can be translates into a bound on the bias of a single regression tree.

###  jackknife
The basic idea of the jackknife is to omit one observation and recompute the estimate using the remaining observations. If some statistic satisfies certain conditions, ordinary jackknife and infinitesimal jackknife can give consistent estimates of the asymptotic variance and asymptotic bias of T.
#### Notations
$X_1,..., X_n$ are random variables with distrivution F. $\theta=T(F)$, where T is a real-value functional defined on some appropriate set of probability distributions including F and a suffciently rich set of probability distributions near to F. $\theta$ is estimated by $\hat{\theta}=T(\hat{F})$, where $\hat{F}$ is the empirical probability distribution, which means that $\hat{F}$ assigns probability 1/n for each $X_i$. Moreover, we define $\hat{\theta}_{(i)}$ to be the estimation of $\theta$ while omitting $X_i$ and $p_{(i)}=n\hat{\theta}-(n-1)\hat{\theta}_{(i)}$
#### Ordinary jackknife( OJK)
The OJK eatimate of $\theta$ is:
$$ p_{(.)}=\frac{1}{n}\sum\limits_{i=1}^n p_{(i)}=n\hat{\theta}-(n-1)\hat{\theta}_{(.)}$$
And the OJK variance estimation of $\hat{\theta}$ is:
$$ \hat{V}=\frac{1}{n(n-1)}\sum\limits_{i=1}^n(p_{(i)}-p_{(.)})^2=\frac{n-1}{n}\sum\limits_{i=1}^n(\hat{\theta}_{(i)}-\hat{\theta}_{(.)})^2$$
The estimated bias is:
$$ \hat{B}=\hat{\theta}-p_{(.)}=(n-1)(\hat{\theta}_{(.)}-\hat{\theta})$$



#### Infinitesimal jackknife
Also give observation slightly less weight than others, and consider the limiting case as the deficiency in the weight approaches zero. We assign weights $w_1,...,w_n$ to $X_1, ..., X_n$. Then we can write T of 2n variables:
$$T( X_1,...,X_n;w_1,...,w_n)$$
If all weights are $\frac{1}{n}$, we have
$$\hat{\theta}=T(\hat{F})=T( X_1,...,X_n;1/n,...,1/n)$$
If we reduce the weight $w_i$ by $\varepsilon$, we have 
$$\hat{\theta}_{(i)}(\varepsilon)=T( X_1,...,X_n;1/n,..., 1/n-\varepsilon,...,1/n)$$
when $\varepsilon\xrightarrow{}0$, we have:
$$n\hat{V}(0)=\frac{1}{n}\sum \hat{D}_i^2$$
where $\hat{V}(0)$ is the IJK variance estimate of $\hat{\theta}$. And the IJK estimated bias is:
$$n\hat{B}(0)=\frac{1}{2n}\sum \hat{D}_{ii}$$
And the IJK estimation of $\theta$ is:
$$\hat{\theta}-\hat{B}(0)$$
where $\hat{D}_i=\frac{\partial T}{\partial w_i}|_{x_j=X_j, w_j=\frac{1}{n}, j=1,...,n}$ , and $\hat{D}_{ii}=\frac{\partial^2 T}{\partial w_i^2}|_{x_j=X_j, w_j=\frac{1}{n}, j=1,...,n}$.


---------------------------------------------------------------------------------------------------------


$E[c_j(x)]=\frac{\pi}{d} \frac{log(s/(2k-1))}{log(\alpha^{-1})}$



$\frac{log(s/(2k-1))}{log(\alpha^{-1})}$

$-\frac{\eta^2}{2}\frac{\pi^2}{d^2} \frac{log(s/(2k-1))}{log(\alpha^{-1})}=-\frac{\eta^2}{2} \frac{log(s/(2k-1))}{\pi^{-2}d^2log(\alpha^{-1})}$