A note on Hard Kumaraswamy Distribution
This is an excerpt from my master thesis titled: “Semi-supervised morphological reinflection using rectified random variables”
In this story we describe the stretch-and-rectify principle applied to the Kumaraswamy distribution [1]. This technique was proposed by Louizos et al., 2017 [2] who rectified samples from a Gumbel-sigmoid distribution.
The Kumaraswamy distribution
The Kumaraswamy distribution (Kumaraswamy, 1980) is a doubly-bounded continuous probability distribution defined in the interval (0,1). Its shape is controlled by two parameters a∈R>0 and b∈R>0. If a=1 or b=1 or both, Kumaraswamy is equivalent to the Beta distribution. For equivalent settings of parameters, Kumaraswamy distribution closely mimics Beta distribution(but with a higher entropy). Its density function and is given below:
where a and b are shape parameters as previously mentioned. Its cumulative distribution function(cdf) can be derived as shown below:
Sampling from Kumaraswamy distribution
We note that cumulative density function has the support [0,1]. Using the cumulative density function shown above, we can derive its inverse as follows:
where z∈[0,1] denotes the cumulative density function value. Therefore, to obtain a Kumaraswamy sample, we first sample from a uniform distribution with support [0,1] and transform it using the inverse cdf. With this formulation, we can reparameterize expectations as described in Nalisnick and Smyth, 2016 [3]. The sampling procedure is shown below:
Rectified Kumaraswamy distribution
Let k denote a base random variable sampled from Kuma(a,b). Its domain is the open interval(0,1). k is stretched to be defined in the open interval (l,r), where l <0,r >1 and we denote the stretched version ass. Its cumulative density function is shown below:
Finally, s is rectified to be defined in the domain [0,1] by passing it through a hard-sigmoid function, i.e., min(1,max(0,s)). We denote the rectified variable with h. Following Bastings et al.(2019) [1] we refer to the stretched and rectified distribution as Hard Kumaraswamy distribution. The probability of sampling exactly s= 0 is 0, since s is continuous in the interval (l, r), sampling any value exactly has a probability of 0. However, sampling h= 0 is equivalent to sampling any s∈(l,0]. Similarly sampling h= 1 is equivalent to sampling any s∈[1,r), i.e.
Figure above illustrates the process of stretch and rectify. The shaded region shows the probability of sampling h=0 (left) and h=1 (right). The rectified variable h has a distribution consisting of point mass at 0 and 1, and a stretched distribution truncated to (0,1),
where f(h) is the probability density function of H, δ(.) denotes the Dirac-delta function and T is the truncated distribution, and
where π₀ and π₁ denote the probability of sampling discrete outcomes,{0}and{1} respectively and π𝒸 denotes probability of sampling a continuous outcome. The truncated density fₜ(t) is introduced as fₛ(s) is properly normalized over (l,r). We can see that fₕ(h) has the following properties:
- Support-consistency: It has support [0,1] and includes the discrete outcomes{0}and{1}.
- Flexibility: It is possible to control the parameters of this distribution such that we can specify the probability of getting the outcomes{0}and{1}.
- Differentiability: The distribution is differentiable almost everywhere with respect to its parameters to take advantage of off-the-shelf (stochastic) gradient ascent techniques.
References
- Bastings, J., Aziz, W., and Titov, I. (2019). Interpretable neural predictions with differentiable binary variables. arXiv preprint arXiv:1905.08160.
- Louizos, C., Welling, M., and Kingma, D. P. (2017). Learning sparse neural networks throughl0 regularization.arXiv preprint arXiv:1712.01312.
- Nalisnick, E. and Smyth, P. (2016). Stick-breaking variational autoencoders.arXiv preprintarXiv:1605.06197.