Generative Model of Inflected Words

Akash Raj
Level Up Coding
Published in
8 min readJun 29, 2021

--

In this project, we aim to model the inflected word forms. We study two generative models namely, Character VAE, a single sequence variational auto encoder with only continuous variables for sequential data and, MSVAE (Multi-Space Variational AutoEncoder), a generative model with both continuous and discrete latent variables. The relevant experiments can be found in this repo: akruti.

Introduction

Languages use affixes to convey information such as stress, intonation and grammatical meaning. These entities are a subset of a more general class of entities which form meaningful sub-parts of a word (morphemes). Morphology refers to the rules and processes through which morphemes are combined. Word inflection is the process through which a word is modified to express various grammatical categories (e.g., tense, stress, gender, mood, etc.). It is crucial for an NLP system to learn and understand such processes; it has been shown that explicitly modelling morphology aids tasks such as machine translation, parsing and word embedding (Dyer et al. [1]; Cotterell et al. [2]). Computational approaches involving word frequency obviously fails to model such unseen words. On the other hand, heuristic based approaches require hand-crafted rules that are language specific.

Can we model the distribution of inflected words?

We wish to investigate generative models to describe the distribution of inflected word forms. The vanilla variational autoencoder can be used to encode a sequence of characters. While it is efficient for representing continuous latent variables, we cannot encode the discrete morphological features. We study a generative model that can encode the word sequence in a continuous latent space and its morphological features in an (approximately) discrete latent variables.

Generative process of the Character Variational Auto-encoder.

Character VAE

An efficient way to handle continuous latent variables in neural models is to employ variational auto-encoders. Let x denote a sequence of characters (which forms an inflected word) and z denote the latent space. The VAE learns a generative model of the probability of observed data x given a latent variable z, and simultaneously uses the recognition model to estimate the latent variable for a particular data point. The graphical model is shown on the side.

Similar to Bowman et al. (2015) [3] who developed a VAE for sequences of tokens, we build a variational auto-encoder for sequences of characters. Let q(.) denote the approximate posterior. The recognition model, q(z|x) parameterizes an approximate posterior over the latent space. We use a standard Gaussian distribution for the prior over the latent variables, i.e., p(z) ∼ N (z|0, I). For the variational family, we use Gaussian distribution. The model is trained by maximizing the variational lower bound on the marginal log likelihood of data:

where θ and φ represent trainable network and variational parameters respectively. Next, we discuss the architecture for Character VAE model,

Character VAE architecture. <SOS> is the start-of-sequence character.

For the encoder, we use bidirectional gated recurrent units. u = [h→; h←] is the hidden representation of the input sequence x, where h→ and h← represent the final hidden states of the encoder from forward and backward directions respectively. μ(u) and σ(u) are multi-layered perceptrons and represent mean and standard deviation respectively of the approximate posterior, q(z|x) = N (z|μ(u), σ(u)). At each time step, the decoder uses (a) the latent representation of x, (b) character predicted in the previous time step and (c) current hidden state, to predict the next most likely character in the sequence.

t-SNE plot of the character VAE latent space. Each data point is a an inflected word. Words with the same lemma are colored similarly.

Results

Figure shows the t-SNE plot of the latent space of the character
VAE. Each point on the plot represents an inflected word, and words having the same lemma are colored similarly. We can observe that a few words sharing the same lemma are closer to each other. However, we cannot see pronounced clusters based on the lemma since the latent space is not conditioned in any manner.

Rationale for using Relaxed Variables

For an inflected word, the neural network should learn whether or not a morphological feature is present. For example, the word “played” should have the feature tense=past turned on. Suppose each morphological feature is modelled using a Bernoulli distribution. Assume there are M ‘independent’ morphological features, yₘ (Note: Even though some features e.g., tense=present, tense=past, etc. are inherently dependent, we initially assume independence and later induce dependency by using MADE: Masked Autoencoder for Density Estimation). For inference, assume each feature has a Bernoulli prior with parameter α. The distribution over y is:

Here, x is the observed sequence which has the morphology tag, y and λ denotes variational parameters. is the approximate mean field Bernoulli posterior with parameters bᵢ. gθ (.) is a multi-layered perceptron parameterized by θ, with x as the input. Using the mean field assumption, the learning objective, L can be written as follows:

In the above equation, the second term involving the KL divergence can be computed analytically, the gradient of the first term with respect to θ can be estimated with a Monte Carlo sample, however, the gradient with respect to λ is intractable and requires the use of REINFORCE. In this method, the choice of the reinforcement baseline can affect the speed of the convergence greatly since it suffers from high variance. An alternative to REINFORCE is to employ a relaxation to categorical variables and employ the straight through estimator. Unfortunately, this method has drawbacks: using a straight through estimator results in biased gradients since it ignores the Heaviside function in the likelihood during gradient evaluation and concrete distribution cannot model the likelihood of the discrete outcomes. We use the Hard Kuma distribution described in this story [4]. to model the distribution of morphological features. This distribution satisfies the desiderata: a differentiable alternative to discrete variables that enables unbiased gradient estimates.

Multi-space Variational Auto-Encoder

Multi-space Variational Auto-encoder (MSVAE) is a generative model which uses both discrete and continuous latent variables. MSVAE can be seen as a combination of generative models with auxiliary variables and sequence VAE.

Graphical model for (supervised) MSVAE model. The grayed variables indicate that the respective labels are observed.

Notation: A sequence of characters (for example, playing) is denoted by x. y is a M-dimensional binary vector; yᵢ represents whether i-th feature is present or absent, i.e., yᵢ ∈ {0, 1}. φ represents the variational parameters for the continuous latent space z which has a standard normal prior over it, i.e., p(z) = N (z|0, I). θ represents the trainable parameters of the neural network. λ represents variational parameters for the approximate posterior over the morphological features. N denotes the number of word sequences. Figure shows the graphical model for supervised (y is available) MSVAE. We can define the generative model, pθ(x, y, z) as follows:

where HK(a₀, b₀) is the Hard Kuma prior, fθ (.) represents a recurrent neural network parameterized by θ. We can train the MSVAE by maximizing the following objective:

Results

We use a similar architecture as described in the Character VAE section. We use the SIGMORPHON 2016 dataset [5] — Turkish task 3 data. At the time of decoding, to predict the most probable character at each step, we use three types of information: (a) current decoder state, (b) morphology tag embeddings, and (c) the latent variable. We do not marginalize over the latent variable and instead use the mean vector as the latent representation.

t-SNE plots of the latent space. Each data point is an inflected word from the Turkish task 3 dataset. (Left) Words with the same lemma are coloured the same. (Right) Words with the same part of speech (pos) tag are coloured the same.

In the left plot, words having the same lemma are colored similarly. We can see that words with the same lemma are roughly clustered together. However, the clusters are not conspicuous. On the right, we visualize the same (continuous) latent space but colored according the part of speech tag: pos=V, pos=N represent Verb and Noun respectively. The latent space is not clustered according to the part of speech tag; this is expected, as we want the continuous space to only encode information about the lemma.

On distributions over the Morphology tag

Mean precision and recall on the Turkish (task 3) test data.

At inference time, the morphological features cannot be given a complete discrete treatment. To model this, we can use Hard distributions (Hard Kuma distribution) and Concrete Random Variables, a continuous relaxation of discrete random variables (Concrete distribution) as possible candidates. Precision and recall for each prediction using the transformed values are computed. We use MADE to induce dependency between the morphological feature distributions.

Hard Kuma performs better than the Concrete distribution:

1. Hard Kuma can model the actual data, i.e., discrete outcomes {0} and {1}, however this is not the case for Concrete
2. The training is possibly more stable due to unbiased gradient estimates.

An example word (koylar) from the Turkish dataset showing the morphological feature predictions. x-axis shows the different morphological features. The first row (actual) shows the target features.

Conclusion

We investigated two generative models to describe the distribution of inflected word forms. The character VAE serves as a building block to efficiently represent continuous latent variables. However, its latent space doesn’t capture any meaningful representations since we do not condition it in any way. MSVAE on the other hand models the inflected word forms from two
latent representations: a continuous space that captures the character sequence and an approximately discrete latent space which represents the morphological features. Following Zhou and Neubig (2017) [6], we gave it a relaxed treatment. However, instead of using concrete distribution with the biased straight through estimator, we use Hard Kuma. The latent space of MSVAE can be roughly clustered according to lemma of the inflected word, but we cannot completely disentangle the morphology tag information from the lemma.

References

  1. Dyer, C., Muresan, S., and Resnik, P. (2008). Generalizing word lattice translation. Technical report, Maryland Univ College Park Inst for Advanced Computer Studies.
  2. Cotterell, R., Schütze, H., and Eisner, J. (2016). Morphological smoothing and extrapolation of word embeddings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1651–1660.
  3. Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M., Jozefowicz, R., and Bengio, S. (2015). Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349.
  4. https://medium.com/nerd-for-tech/a-note-on-hard-kumaraswamy-distribution-b74278dc6877
  5. Cotterell, R., Kirov, C., Sylak-Glassman, J., Yarowsky, D., Eisner, J., and Hulden, M. (2016). The sigmorphon 2016 shared task — morphological reinflection. In Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 10–22.
  6. Zhou, C. and Neubig, G. (2017). Multi-space variational encoder-decoders for semi-supervised labeled sequence transduction. arXiv preprint arXiv:1704.01691.
  7. Code: https://github.com/akashrajkn/akruti

--

--