3 Conditioning: the plot thickens
We have seen that probability distributions (or cumulative distribution functions, or yet quantile functions) are a fundamental tool in modeling, describing and quantifying uncertainty about one-dimensional numerical phenomena. Unfortunately the univariate setting is too restrictive: a handful of applications in Statistics and Machine Learning (and many other Scientific fields) deal with a scenario where we have uncertainty “up to covariates,” meaning that the value of a given variable (possibly a vector) of interest (the response variable) is partially determined by the values of other variables (the covariates or regressors), the latter being known to the researcher (and maybe even under her control).
Regression, in a myriad of guises, appears as one of the methods most widely employed to model the dependence structure between a response and covariates. Before moving on to regression proper, let us first introduce a precise notion of conditioning. In the statement below, \(\mathbb I\) denotes the indicator function.
Theorem 3.1 (Regular conditional distribution) Let \(Y\) be a real valued random variable and let \(X\) be a \({\mathrm{D}_X}\)-dimensional random vector. Then there exists a function \((B,x)\mapsto \pi(B,x)\) defined for Borel sets \(B\subseteq\mathbb{R}\) and \(x\in \mathbb{R}^{\mathrm{D}_X},\) satisfying the following:
- for each fixed \(x\in\mathbb{R}^{\mathrm{D}_X},\) the function \(B\mapsto \pi(B,x)\) is a Borel probability measure on \(\mathbb{R}.\)
- for each fixed Borel subset \(B\subseteq\mathbb{R},\) the function \(x\mapsto \pi(B,x)\) is measurable and integrable.
- It holds that \[\begin{equation} \mathbf{P}[Y\in B, X\in A] = \mathbf{E}\big[\pi(B,X)\cdot\mathbb{I}[X\in A]\big] \tag{3.1} \end{equation}\]
Moreover, the function \(\pi\) above is essentially unique in the sense that, if \(\pi_0\) is another function satisfying items 1 to 3, then there exists a Borel subset \(A^*\subseteq\mathbb{R}^{\mathrm{D}_X}\) with \(\mathbf{P}[X\in A^*]=1\) such that \(\pi(B,x) = \pi_0(B,x)\) for all Borel subset \(B\subseteq\mathbb{R}\) and all \(x\in A^*.\)
Remark. Theorem 3.1 still holds when \(Y\) is a random vector. We chose to state it in the simpler setting of a scalar \(Y\) since this is the case which will be more prevalent throughout the text. Also, the “essential uniqueness” of \(\pi\) tells us in particular that the definition of \(\pi(\cdot,x)\) for \(x\) outside the support of \(X\) is arbitrary.
The function \(\pi\) appearing in Theorem 3.1 is called the regular conditional distribution of \(Y\) given \(X\). It is customary to write \(\pi(B,x) =: \mathbf{P}[Y\in B\,|\,X=x]\) and call this right hand side the conditional probability of the event \([Y\in B]\) given \(X=x\). It is also convenient to use the notation \(\pi(B,X) =: \mathbf{P}[Y\in B\,|\, X]\) (this is a random variable) which is called the conditional probability of the event \([Y\in B]\) given \(X\). The difference is subtle but relevant. Notice also that there is redundancy in writing \(\mathbf{P}[Y\in B\,|\,X=x]\): for if \(\mathbf{P}[X=x]>0\) for some \(x,\) the elementary notion of conditional probability would entreat us to think of the equality \[\begin{equation} \mathbf{P}[Y\in B\,|\,X=x] = \frac{\mathbf{P}[Y\in A, X=x]}{\mathbf{P}[X=x]}, \end{equation}\] but \(\pi(B,x)\) is not necessarily expressible as above. However, when \(X\) is discrete, this is precisely the case, as the following exemple illustrates.
Example 3.1 Let \(Y\) be a random variable and \(X\) a \(\mathbf{P}\)-discrete random vector of dimension \({\mathrm{D}_X}.\) Then the function \(\pi\) defined, for Borel subsets \(B\subseteq\mathbb{R}\) and \(x\in\mathbb{R}^{\mathrm{D}_X},\) by the relation \[\begin{equation} \pi(B,x)\mathbf{P}[X=x] := \mathbf{P}[Y\in B, X=x] \end{equation}\] is a regular conditional distribution of \(Y\) given \(X\) (it is an exercise to check that this assertion is true!)
Example 3.2 Let \(Y\) be a scalar random variable, and let \(X\) be a \({\mathrm{D}_X}\)-dimensional random vector. Assume that \((Y,X)\) is \(\mathbf{P}\)-absolutely continuous, meaning that \((Y,X)\) have joint density function \(f_{Y,X}.\)7 Then the function \(\pi\) defined, for Borel subsets \(B\subseteq\mathbb{R}\) and \(x\in\mathbb{R}^{\mathrm{D}_X},\) by \[\begin{equation} \pi(B,x) := \int_{B} f_{Y|X}(y|x)\,\mathrm{d}y, \end{equation}\] is a regular conditional distribution of \(Y\) given \(X.\) In the above, \(f_X\) is the marginal density function of \(X\), i.e., \(f_X(x) = \int_\mathbb{R}f_{Y,X}(y,x)\,\mathrm{d}y,\) and \(f_{Y|X}\) is the conditional density function of \(Y\) given \(X\), defined by the relation \[\begin{equation} f_{Y|X}(y|x)f_X(x) := f_{Y,X}(y,x) \end{equation}\] for \(x\in \mathbb{R}^{\mathrm{D}_X}\) and \(y\in\mathbb{R}.\) \(\blacksquare\)
It is important to notice that the equality in (3.1) can be rewritten as \[\begin{equation} \mathbf{P}[Y\in B, X\in A] = \int_A \mathbf{P}(Y\in B\,|\,X=x)\,F_X(\mathrm{d}x), \tag{3.2} \end{equation}\] where “\(\int \cdot \,F_X(\mathrm{d}x)\)” is used to denote the Lebesgue-Stieltjes integral (\(F_X\) being the cumulative distribution function of \(X\)). Of course, whenever \(X\) is \(\mathbf{P}\)-absolutely continuous, with density function \(f_X,\) this integral reduces to \[\begin{equation} \mathbf{P}[Y\in B, X\in A] = \int_A \mathbf{P}(Y\in B\,|\,X=x)f_X(x)\,\mathrm{d}x. \end{equation}\] Similarly, if \(X\) is discrete with probability mass function \(p_X,\) we have \[\begin{equation} \mathbf{P}[Y\in B, X\in A] = \sum_{\{x\in A\,:\,p_X(x)>0\}}\mathbf{P}(Y\in B\,|\,X=x) p_X(x) \end{equation}\] which, in view of example 3.1, is just the classical Law of Total Probability.
The essentially unique non-negative function satisfying the identity \(F_{Y,X}(y,x) = \int_{-\infty}^y\int_{-\infty}^x f_{Y,X}(u,v)\,\mathrm{d}v\mathrm{d}u\) for all \(y\in\mathbb{R}\) and \(x\in \mathbb{R}^{\mathrm{D}_X}.\) Here, \(\int_{-\infty}^x \mathrm{d}v\) means \(\int_{-\infty}^{x_1}\cdots \int_{-\infty}^{x_{\mathrm{D}_X}} \mathrm{d}v_{\mathrm{D}_X}\cdots \mathrm{d}v_1.\)↩︎