Neural Machine Translation – NMT

Neural Machine Translation – NMT is an approach to machine translation that uses a large neural network. It departs from phrase-based statistical translation approaches that use separately engineered subcomponents. Google and Microsoft translation services now use NMT. Google uses Google Neural Machine Translation – GNMT in preference to its previous statistical methods. Microsoft uses a similar technology for its speech translations (including Microsoft Translator live and Skype Translator). An Open Source Neural Machine Translation System – OpenNMT, has been released by the Harvard NLP group.

NMT models use deep learning and representation learning. They require only a fraction of the memory needed by traditional Statistical Machine Translation – SMT models. Furthermore, unlike conventional translation systems, all parts of the neural translation model are trained jointly (end-to-end) to maximize the translation performance.

A bidirectional Recurrent Neural Network – RNN, known as an encoder, is used by the neural network to encode a source sentence for a second RNN, known as a decoder, that is used to predict words in the target language.

How Neural Machine Translation – NMT works

Neural Machine Translation is a relatively new paradigm, first explored toward the end of 2014. Before this, machine translation operated on a statistical model whereby machine learning depends on a database of previous translations, called translation memories.

While NMT still trains on translation memories as Statistical Machine Translation does, it uses deep learning—and possibly a higher volume of training data—to build an artificial neural network.

Marciano uses a game of chess to illustrate how Statistical Machine Translation – SMT works. In a chess program, there is a limited universe in which a limited number of moves can be made. The program simply calculates all possible moves to find the best one. Similarly, the machine learning that takes place in an SMT system works by comparing n-grams—or 6-word groupings of words in a sentence—from a source sentence to those that occur in the target language to find correlations.

On the other hand, Neural Machine Translation could be described as “raising” a neural system, as Marciano explains. It’s like playing the piano: When you make a mistake, you back up, try again, and repeat until you have it down. Neural MT systems try to find their way through neural networks in the same way.

In this sense, Neural MT is much more effective than the limited, and often inaccurate, n-gram-based model. For one thing, NMT systems run on powerful GPUs (graphical processing units), rather than CPUs (central processing units) as SMT systems do. And although Neural MT takes longer to translate a sentence due to the wealth of data involved—as SMT systems took much longer than older rule-based systems—Statistical MT presents big problems with languages where rules occur outside of the six-word unit.

Of course, NMT does still run into a few issues: for example, when translating highly technical content. But source material containing unknown technical abbreviations would not be translated well by any machine translation system, Neural MT included. For language directions that don’t have much training data—for instance, German to Korean—deep learning opens up the possibility of using indirect, or “pivoted,” training data from the source material of another language.

The major difference between Neural Machine Translation – NMT and Statistical Machine Translation – SMT?

When you present training material to the deep learning algorithms, you don’t necessarily tell them what to look for. You let the system find patterns themselves, such as contextual clues around the source sentence. The specifics of the process, however, remain mysterious in many ways.

Neural Machine Translation – NMT and Big Data

Neural networks were first used in image and speech recognition programs, by training systems with supervised data—such as an image of a dog with metadata attached. In reading its metadata, the system would know to identify the content of the image as a dog.

Then, the system would try to find the best way through the neural network to make that link, backing up and finding better pathways if it finds the wrong answer, and eventually developing a neural pathway that results in the correct answer. This is the pathway that would be emphasized going forward.

In speech recognition, for a given recorded sentence in a given language, there is generally only one correct transcription for deep learning to find—making the training pretty straightforward. Translation involves “noisier” training material and is a more complex task.

Yet, deep learning and big data, Marciano describes, allows us to cast away our limited abilities to perceive and analyze the world. As big data yields so much information, we’re able to identify complicated patterns, and associations among these patterns, in ways that are beyond human ability to recognize.

But, as a recent article about Google’s deep learning algorithms shows, it’s difficult to build a mental picture of the NMT process. Much of the processing is done in “hidden layers” of complicated data, meaning it’s hard to see how the neural network makes its decisions.

This why we can only present the training material, let the algorithms do their thing, and tweak the training material if the translations aren’t accurate. Lionbridge uses GeoFluent to clean up errors in the Neural MT output, too.

Using quality evaluation methods, such as BLEU, becomes a gray area. If a Neural MT system chooses a translation that’s different to the reference translation for an obscure reason, then it may be penalized for its vocabulary choice—even if it’s perfectly correct.

The future of Artificial Neural Networks – ANN and communication

Although it’s tricky to debug a neural network and understand its decision-making, the improvement in fluency we’re seeing from Neural MT is encouraging enough for it to be a strong consideration. So, are any other machine translation vendors providing Neural MT now?

The short answer is no. There are three Neural MT systems that you can try right now on the internet: Google Translate (which can be integrated into any given computer-aided translation [CAT] tool), Microsoft Translator, and Systran Pure Neural Machine Translation. However, we are still a little bit ahead of the curve in terms of production-ready systems that have complete training tool sets. Look out for announcements about upcoming NMT systems this year from Microsoft, Google, Systran, Baidu, Facebook, Amazon, and others.

The Neural MT roll out will happen first on those language directions that show the biggest improvement over the SMT systems. At Lionbridge, we plan to evaluate available neural translation systems to see these tools fit into our localization processes and meet our customers’ needs before rolling out ourselves.

But one thing is for certain: Neural MT is a game changer. Considering how young this model is, improvements in translation have been enormous compared to the last 10 years. The difference between traditional translation and machine translation will continue to narrow—and we’re intent on finding out just how far this can go.

Tips on Building Neural Machine Translation Systems

Source: GitHub

Source URL:

by Graham Neubig (Nara Institute of Science and Technology/Carnegie Mellon University)

This tutorial will explain some practical tips about how to train a neural machine translation system. It is partly based around examples using the lamtram toolkit. Note that this will not cover the theory behind NMT in detail, nor is it a survey meant to cover all the work on neural MT, but it will show you how to use lamtram, and also demonstrate some things that you have to do in order to make a system that actually works well (focusing on ones that are implemented in my toolkit).

This tutorial will assume that you have already installed lamtram (and the DyNet backend library that it depends on) on Linux or Mac. Then, use git to pull this tutorial and the corresponding data.

The data in the data/ directory is Japanese-English data that I have prepared doing some language-specific preprocessing (tokenization, lowercasing, etc.). Enter the nmt-tips directory

and make a link to the directory in which you installed lamtram:

Machine Translation

Machine translation is a method for translating from a source sequence F with words f_1, ..., f_J to a target sequence E with words e_1, ..., e_I. This usually means that we translate between a sentence in a source language (e.g. Japanese) to a sentence in a target language (e.g. English). Machine translation can be used for other applications as well.

In recent years, the most prominent method is Statistical Machine Translation (SMT; Brown et al. (1993)), which builds a probabilistic model of the target sequence given the source sequence P(E|F). This probabilistic model is trained using a large set of training data containing pairs of source and target sequences.

A good resource on machine translation in general, including a number of more traditional (non-Neural) methods is Koehn (2009)’s book “Statistical Machine Translation”.

Neural Machine Translation (NMT) and Encoder-decoder Models

Neural machine translation is a particular variety of SMT that learns the probabilistic model P(E|F) using neural networks. I will assume that readers already know basic concepts about neural networks: what a neural network is, particularly what a recurrent neural network is, and how they are trained. If you don’t, a good tutorial is Goldberg (2015)’s primer on neural networks for natural language processing.

Encoder-decoder models (Kalchbrenner & Blunsom 2013, Sutskever et al. 2014) are the simplest version of NMT. The idea is relatively simple: we read in the words of a target sentence one-by-one using a recurrent neural network, then predict the words in the target sentence.

First, we encode the source sentence. To do so, we convert the source word into a fixed-length word representation, parameterized by Φ_wr:

Then, we map this into a hidden state using a recurrent neural network, parameterized by Φ_frnn. We assume h_0 is a zero vector.

It is also common to generate h_j using bidirectional neural networks, where we run one forward RNN that reads from left-to-right, and another backward RNN that reads from right to left, then concatenate the representations for each word. This is the default setting in lamtram (specified by --encoder_types "for|rev").

Next, we decode to generate the target sentence, one word at a time. This is done by initializing the first hidden state of the decoder g_0 to be equal to the last hidden state of the encoder: g_0 = h_J. Next, we generate a word in the output by performing a softmax over the target vocabulary to predict the probability of each word in the output, parameterized by Φ_esm:

We then pick the word that has highest probability:

We then update the hidden state with this predicted value:

This process is continued until a special “end of sentence” symbol is chosen for e'_i.

Training NMT Models with Maximum Likelihood

Note that the various elements of the model explained in the previous model have parameters Φ. These need to be learned in order to generate high-quality translations. The standard way of training neural networks is by using maximum likelihood. This is done by maximizing the log likelihood of the training data:

or equivalently minimizing the negative log likelihood:

The standard way we do this minimization is through stochastic gradient descent (SGD), where we calculate the gradient of the negative log probability for a single example <F,E>

then update the parameters based on an update rule:

The most standard update rule simply subtracts the gradient of the negative log likelihood multiplied by a learning rate γ

Let’s try to do this with lamtram. First make a directory to hold the model:

then train the model with the following commands:

Here, model_type indicates that we want to train an encoder-decoder, train_src and train_trg indicate the source and target training data files. trainer specifies that we will use the standard update rule, and learning_rate specifies γ. rate_decay will be explained later. epochs is the number of passes over the training data, and model_out is the place where the model is written out to.

If training is going well, we will be able to see the following log output:

ppl is reporting perplexity on the training set, which is equal to the exponent of the per-word negative log probability:

For this perplexity, lower is better, so if it’s decreasing we’re learning something.

One thing you’ll notice is that training is really slow… There are 10,000 sentences in our small training corpus, but you’re probably tired of waiting already. The next section will explain how we speed things up, so let’s let it run for a while, and then when you get tired of waiting hit CTRL+C to stop training.

Speeding Up Training


One powerful tool to speed up training of neural networks is mini-batching. The idea behind minibatching is that instead of calculating the gradient for a single example <E,F>

we calculate the gradients for multiple examples at one time

then perform the update of the model’s parameters using this aggregated gradient. This has several advantages:

  • Gradient updates take time, so if we have N sentences in our minibatch, we can perform N times fewer gradient updates.
  • More importantly, by sticking sentences together in a batch, we can share some of the calculations between them. For example, where non-minibatched neural networks might multiply the hidden vector h_i by a weight matrix W, when we are mini-batching we can connect h_i from different sentences into a single matrix H and do a big matrix-matrix multiplication W * H, which is much more efficient.
  • Also, using mini-batches can make the updates to the parameters more stable, as information from multiple sentences is considered at one time.

On the other hand, large minibatches do have disadvantages:

  • If our mini-batch sizes are too big, sometimes we may run out of memory by trying to store too many calculated values in memory at once.
  • While the calculation of each sentence becomes faster, because the total number of updates is fewer, sometimes training can be slower than when not using mini-batches.

Anyway, let’s try this in lamtram by adding the --minibatch_size NUM_WORDS option, where NUM_WORDS is the number of words included in each mini-batch. If we set NUM_WORDS to be equal to 256, and re-run the previous command, we get the following log:

Looking at the w/s (words per second) on the right side of the log, we can see that we’re processing data 4 times faster than before, nice! Let’s still hit CTRL+C though, as we’ll speed up training even more in the next section.

Other Update Rules

In addition to the standard SGD_UPDATE rule listed above, there are a myriad of additional ways to update the parameters, including “SGD With Momentum”, “Adagrad”, “Adadelta”, “RMSProp”, “Adam”, and many others. Explaining these in detail is beyond the scope of this tutorial, but it suffices to say that these will more quickly find a good place in parameter space than the standard method above. My current favorite optimization method is “Adam” (Kingma et al. 2014), which can be run by setting --trainer adam. We’ll also have to change the initial learning rate to --learning_rate 0.001, as a learning rate of 0.1 is too big when using Adam.

Try re-running the following command:

You’ll probably find that the perplexity drops significantly faster than when using the standard SGD update (after the first epoch, I had a perplexity of 287 with standard SGD, and 233 with Adam).

GPUs (Advanced)

If you have access to a machine with a GPU, this can make training much faster, particularly when training NMT systems with large vocabularies or large hidden layer sizes using minibatches. Running lamtram on GPUs is simple, you just need to compile the DyNet library using the CUDA backend, then link lamtram to it appropriately. However, in our case here we are using a small network and small training set, so training on CPU is sufficient for now.


Basic Concept

One of the major advances in NMT has been the introduction of attention (Bahdanau et al. 2015). The basic idea behind attention is that when we want to generate a particular target word e_i, that we will want to focus on a particular source word f_j, or a couple words. In order to express this, attention calculates a “context vector” c_i that is used as input to the softmax in addition to the decoder state:

This context vector is defined as the sum of the input sequence vectors h_j, weighted by an attention vector a_i as follows:

There are a number of ways to calculate the attention vector a_i (described in detail below), but all follow a basic pattern of calculating an attention score α_{i,j} for every word that is a function of g_i and h_j:

and then use a softmax function to convert score vector α_i into an attention vector a_i that adds to one.

If you want to try to train an attentional model with lamtram, just change all mentions of encdec above to encatt (for encoder/attentional), and an attentional model will be trained for you. For example, we can run the following command:

If you compare the perplexities between these two methods you may see some difference in the perplexity results after 10 epochs. When I ran these, I got a perplexity of 19 for the encoder-decoder, and a perplexity of 11 for the attentional model, demonstrating that it’s a bit easier for the attentional model to model the training corpus correctly.

Types of Attention (Advanced)

There are several ways to calculate the attention scores α_{i,j}, such as those investigated by Luong et al. (2015). The following ones are implemented in lamtram, and can be changed using the --attention_type TYPE option as noted below.

  • Dot Product: Calculate the dot product α_{i,j} = g_i * transpose(wf_j) (--attention_type dot).
  • Bilinear: A bilinear model that puts a parameterized transform Φ_bilin between the two vectors α_{i,j} = g_i * Φ_bilin * transpose(wf_j) (--attention_type bilin).
  • Multi-layer Perceptron: Input the two vectors into a multi-layer perceptron with a hidden layer of size LAYERNODES, α_{i,j} = MLP([g_i, wf_j]; Φ_mlp) (--attention_type mlp:LAYERNODES)

In practice, I’ve found that dot product tends to work pretty well, and because of this it’s the default setting in lamtram. However, the multi-layer perceptron also performs well in some cases, so sometimes it’s worth trying.

In addition, Luong et al. (2015) introduced a method called “attention feeding,” which uses the context vector c_{i-1} of the previous state as input to the decoder neural network. This is enabled by default using the --attention_feed true option in lamtram, as it seems to help somewhat consistently.

Testing NMT Systems

Now that we have a couple translation models, and know how good they are doing on the training set (according to perplexity), we will want to test to see how well they will do on data that is not used in training. We do this by measuring accuracy on some test data, conveniently prepared in data/test.ja and data/test.en.

The first way we can measure accuracy is by calculating the perplexity on this held-out data. This will measure the accuracy of the NMT systems probability estimates P(E|F), and see how well they generalize to new data. We can do this for the encoder-decoder model using the following command:

and likewise for the attentional model by replacing encdec with encatt (twice) in the command above. Note here that we’re actually getting perplexities that are much worse for the test set than we did on the training set (I got train/test perplexities of 19/118 for the encdec model and 11/112 for the encatt model). This is for two reasons: lack of handling of words that don’t occur in the training set, and overfitting of the training set. I’ll discuss these later.

Next, let’s try to actually generate translations of the input using the following command (likewise for the attentional model by swapping encdec into encatt):

We can then measure the accuracy of this model using a measure called BLEU score (Papineni et al. 2002), which measures the similarity between the translation generated by the model and a reference translation created by a human (data/test.en):

This gave me a BLEU score of 1.76 for encdec and 2.17 for encatt, which shows that we’re getting something. But generally we need a BLEU score of at least 15 or so to have something remotely readable, so we’re going to have to try harder.

Thresholding Unknown Words

The first problem that we have to tackle is that currently the model has no way of handling unknown words that don’t exist in the training data. The most common way of fixing this problem is by replacing some of the words in the training data with a special <unk> symbol, which will also be used when we observe an unknown word in the testing data. For example, we can replace all words that appear only once in the training corpus by performing the following commands.

Then we can re-train the attentional model using this new data:

This greatly helps our accuracy on the test set: when I measured the perplexity and BLEU score on the test set, this gave me 58 and 3.32 respectively, a bit better than before! It also speeds up training quite a bit because it reduces the size of the vocabulary.

Using a Development Set

The second problem, over-fitting, can be fixed somewhat by using a development set. The development set is a set of data separate from the training and test sets that we use to measure how well the model is generalizing during training. There are two simple ways to help use this set to prevent overfitting.

Early Stopping

The first way we can prevent overfitting is regularly measure the accuracy on the development data, and stop training when we get the model that has the best accuracy on this data set. This is called “early stopping” and used in most neural network models. Running this in lamtram is easy, just specify the dev_src and dev_trg options as follows. (You may also want to increase the number of training epochs to 20 or so to really witness how much the model overfits in later stages of training.)

You’ll notice that now after every pass over the training data, we’re measuring the perplexity on the development set, and the model is written out only when the perplexity is its best value yet. In my case, the development perplexity reaches its peak on the 8th iteration then starts getting worse. In my case, by stopping training on the 8th iteration, the perplexity improved a slight bit to 56, but this didn’t make a big difference in BLEU.

Rate Decay

Another trick that is often used is “rate decay” this reduces the learning rate γ every time the perplexity gets worse on the development set. This causes the model to update the parameters a bit more conservatively, which as an effect of controlling overfitting. We can enable rate decay by setting the rate_decay parameter to 0.5 (which will halve the learning rate everytime the development perplexity gets worse). This prolongs training a little bit, so let’s also set the number of epochs to 15, just to make sure that training has run its course.

In my case, the rate was decayed on every epoch after the 8th. This didn’t result in an improvement on this particular data set, but in many cases this rate decay can be quite helpful.

Using an External Lexicon

One way to further help out neural MT systems is to incorporate an external lexicon indicating mappings between words (and their probabilities).

Training the Lexicon

First, we need to create the lexicon. This can be done using a word alignment tool that finds the correspondences between words in the source and target sentences. Here we will use fast_align because it is simple to use and fast, but other word alignment tools such as GIZA++ or Nile might give you better results.

First, let’s download and build fast_align:

Then, we can run fast_align on the training data to build a lexicon, and use the script to convert it into a format that lamtram can use.

Unknown Word Replacement

The first way we can use this lexicon is by using it to map unknown words in the source language into the target language. Without a lexicon, when an unknown word is predicted in the target, the NMT system will find the word in the source sentence with the highest alignment weight a_j and map it into the target as-is. If we have a lexicon, if the source word has a lexicon entry, instead of mapping the word f_j as-is, the NMT system will output the word with the highest probability P_{lex}(e|f_j) in the lexicon.

This can be done in lamtram by specifying the map_in function during decoding:

This helped a little bit, raising the BLEU score from 2.58 to 2.63 for my model.

Improving Translation Probabilities

Another way we can use lexicons is to use them to bootstrap translation probabilities (Arthur et al. 2016). This works by calculating a lexicon probability based on the attention weights a_j

This is then added as an additional information source when calculating the softmax probabilities over the output. The advantage of this method is that the lexicon is fast to train, and also contains information about what words can be translated into others in an efficient manner, making it easier for the MT system to learn correct translations, particularly of rare words.

This method can be applied by adding the attention_lex options as follows. “alpha” is a parameter to adjust the strength of the lexicon, where smaller indicates that more weight will be put on the lexicon probabilities:

In my running, this improves our perplexity from 57 to 37, and BLEU score from 2.48 to 8.83, nice!


Beam Search

In the initial explanation of NMT, I explained that translations are generated by selecting the next word in the target sentence that maximizes the probability pe_i. However, while this gives us a locally optimal decision about the next word e'_i, this is a greedy search method that won’t necessarily give us the sentence E' that maximizes the translation probability P(E|F).

To improve search (and hopefully translation accuracy), we can use “beam search,” which instead of considering the one best next word, considers the k best hypotheses at every time step i. If k is bigger, search will be more accurate but slower. k can be set with the --beam option during decoding, so let’s try this here with our best model so far:

where we replace the two instances of BEAM above with values such as 1, 2, 3, 5.

Looking at the results

we can see that by increasing the beam size, we can get a decent improvement in BLEU.

Adjusting for Sentence Length

However, there is also something concerning about the previous result. “ratio=” is the ratio of “output length”/”reference length” and if this is less than 1, our sentences are too short. We can see that as we increase the beam size, our sentences are getting to be much shorter that the reference. The reason for this is that as sentences get longer, their probability tends to get lower, and when we increase the beam size we become more effective at finding these shorter sentences.

There are a number of ways to fix this problem, but the easiest is adding a “word penalty” wp which multiplies the probability of the sentence by the constant “e^{wp}” every time an additional word is added. This is equivalent to setting a prior probability on the length of the sentence that follows an exponential distribution. wp can be set using the --word_pen option of lamtram, so let’s try setting a few different values and measure the BLEU score for beam width of 10:

We can see that as we increase the word penalty, this gives us more reasonably-lengthed hypotheses, which also improves the BLEU a little bit.

Changing Network Structure

One thing that we have not considered so far is the size of the network that we’re training. Currently the default for lamtram is that all recurrent networks have 100 hidden nodes (or when using forward/backward encoders, the encoders will be 50 and decoder will be 100). In addition, we’re using only a single hidden layer, while many recent systems use deeper networks with 2-4 hidden layers. These can be changed using the --layers option of lamtram, which defaults to lstm:100:1, where the first option is using LSTM networks (which tend to work pretty well), the second option is the width, and third option is the depth. Let’s try to train a wider network by setting --layers lstm:200:1.

One thing to note is that the DyNet toolkit has a default limit of using 512MB of memory, but once we start using larger networks this might not be sufficient. So we’ll also increase the amount of memory to 1024MB by adding the --dynet_mem 1024 parameter.

Note that this makes training significantly slower, because we need to do twice as many calculations in many of our matrix multiplications. Testing this model, the model with 200 nodes reduces perplexity from 37 to 33, and improves BLEU from 10.00 to 10.21. When using larger training data we’ll get even bigger improvements by making the network bigger.


One final technique that is useful for improving final results is “ensembling,” or combining multiple models together. The way this works is that if we have two probability distributions pe_i^{(1)} and pe_i^{(2)} from multiple models, we can calculate the next probability by linearly interpolating them together:

or log-linearly interpolating them together:

Performing ensembling at test time in lamtram is simple: in --models_in, we simply add two different model options separated by a pipe, as follows. The default is linear interpolation, but you can also try log-linear interpolation by setting --ensemble_op logsum. Let’s try ensembling our 100-node and 200-node models to measure perplexity:

This reduced the perplexity from 36/33 to 30 for the ensembled model, and resulted in a BLEU score of 10.99. Of course, we can probably improve this by ensembling even more models together. It’s actually OK to just train several models of the same structure with different random seeds (if you set the --seed parameter of lamtram you can set a different seed, or by default a different one will be chosen randomly every time).

Final Output

Because we’re basically done, I’ll also list up a few examples from the start of the test corpus, where the first line is the input, the second line is the correct translation, and the third line is generated translation.

Not great, but actually pretty good considering that we only have 10,000 sentences of training data, and that Japanese-English is a pretty difficult language pair to translate!

More Advanced (but very useful!) Methods

The following are a few extra methods that can be pretty useful in some cases, but I won’t be testing here:


As mentioned before, when dealing with small data we need to worry about overfitting, and some ways to fix this are ealy stopping and learning rate decay. In addition, we can also reduce the damage of overfitting by adding some variety of regularization.

One common way of regularizing neural networks is “dropout” (Srivastava et al. 2014) which consists of randomly disabling a set fraction of the units in the input network. This dropout rate can be set with the --dropout RATE option. Usually we use a rate of 0.5, which has nice theoretical properties. I tried this on this data set, and it reduced perplexity from 33 to 30 for the 200 node model, but didn’t have a large effect on BLEU scores.

Another way to do this is using L2 regularization, which puts a penalty on the L2 norm of the parameter vectors in the model. This can be applied by adding --dynet_l2 RATE to the beginning of the option list. I’ve personally had little luck with getting this to work for neural networks, but it might be worth trying.

Using Subword Units

One problem with neural network models is that as the vocabulary gets larger, training time increases, so it’s often necessary to replace many of the words in the vocabulary with <unk> to ensure that training times remain reasonable. There are a number of ways that have been proposed to handle the problem of large vocabularies. One simple way to do so without sacrificing accuracy on low-frequency words (too much) is by splitting rare words into subword units. A method to do so by Sennrich et al. (2016) discovers good subword units using a method called “byte pair encoding”, and is implemented in the subword-nmt package. You can use this as an additional pre-processing/post-processing step before learning and using a model with lamtram.

Training for Other Evaluation Measures

Finally, you may have noticed throughout this tutorial that we are training models to maximize the likelihood, but evaluating our models using BLEU score. There are a number of methods to resolve this mismatch between the training and testing criteria by directly optimizing NMT systems to improve translation accuracy. In lamtram, a method by Shen et al. (2016) can be used to optimize NMT systems for expected BLEU score (or in other words, minimize the risk). In particular, I’ve found that this does a good job of at least ensuring that the NMT system generates output that is of the appropriate length.

There are a number settings that should be changed when using the method:

  • --learning_criterion minrisk: This will enable minimum-risk based training.
  • --model_in FILE: Because this method is slow to train, it’s better to first initialize the model using standard maximimum likelihood training, then fine-tune the model with BLEU-based training. This method can be used to read in an already-trained model.
  • --minrisk_num_samples NUM: This method works by generating samples from the model, then evaluating these generated samples. Increasing NUM improves the stability of the training, but also reduces the training efficiency. A value 20-100 should be reasonable.
  • --minrisk_scaling, --minrisk_dedup: Parameters of the algorithm including the scaling factors for probabilities, and whether to include the correct answer in the samples or not.
  • --trainer sgd --learning_rate 0.05: I’ve found that using more advanced optimizers like Adam actually reduces stability in training, so using vanilla SGD might be a safer choice. Slightly lowering the learning rate is also sometimes necessary.
  • --eval_every 1000: Training is a bit slower than standard NMT training, so we can evaluate more frequently than when we finish the whole corpus.

The final command will look like this:

Preparing Data

Data Size

Up until now, you have just been working with the small data set of 10,000 that I’ve provided. Having about 10,000 sentences makes training relatively fast, but having more data will make accuracy significantly higher. Fortunately, there is a larger data set of about 140,000 sentences called train-big.ja and train-big.en, which you can download by running the following commands.

Try re-running experiments with this larger data set, and you will see that the accuracy gets significantly higher. In real NMT systems, it’s common to use several million sentences (or more!) to achieve usable accuracies. Sometimes in these cases, you’ll want to evaluate the accuracy of your system more frequently than when you reach the end of the corpus, so try specifying the --eval_every NUM_SENTENCES command, where NUM_SENTENCES is the number of sentences after which you’d like to evaluate on the dev set. Also, it’s highly recommended that you use a GPU for training when scaling to larger data and networks.


Also note that up until now, we’ve taken it for granted that our data is split into words and lower-cased. When you build an actual system, this will not be the case, so you’ll have to perform these processes yourself. Here, for tokenization we’re using:

  • English: Moses (Koehn et al. 2007)
  • Japanese: KyTea (Neubig et al. 2011)

And for lowercasing we’re using:

Make sure that you do tokenization, and potentially lowercasing, before feeding your data into lamtram, or any MT toolkit. You can see an example of how we converted the Tanaka Corpus into the data used for the tutorial by looking at scripts/ script.

Final Word

Now, you know a few practical things about making an accurate neural MT system. Using the methods described here, we were able to improve a system trained on only 10,000 sentences from 1.83 BLEU to 10.99 BLEU. Switching over to larger data should result in much larger increases, and may even result in readable translations.

This is a very fast-moving field, so this guide might be obsolete in a few months from the writing (or even already!) but hopefully this helped you learn the basics to get started, start reading papers, and come up with your own methods/applications.


  • Philip Arthur, Graham Neubig, Satoshi Nakamura. Incorporating Discrete Translation Lexicons in Neural Machine Translation. EMNLP, 2016
  • Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. ICLR, 2015.
  • Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, Robert L. Mercer. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 1993.
  • Yoav Goldberg. A primer on neural network models for natural language processing. ArXiv, 2015.
  • Nal Kalchbrenner, Phil Blunsom. Recurrent Continuous Translation Models. EMNLP, 2013.
  • Diederik Kingma, Jimmy Ba. Adam: A method for stochastic optimization. ArXiv, 2014.
  • Philipp Koehn et al. Moses: Open source toolkit for statistical machine translation. ACL, 2007.
  • Philipp Koehn. Statistical machine translation. Cambridge University Press, 2009.
  • Minh-Thang Luong, Hieu Pham, Christopher D. Manning. Effective approaches to attention-based neural machine translation. EMNLP, 2015.
  • Graham Neubig, Yosuke Nakata, Shinsuke Mori. Pointwise prediction for robust, adaptable Japanese morphological analysis. ACL, 2011.
  • Rico Sennrich, Barry Haddow, Alexandra Birch. Neural machine translation of rare words with subword units. ACL, 2016.
  • Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. JMLR, 2014.
  • Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, Yang Liu. Minimum risk training for neural machine translation. ACL, 2016.
  • Ilya Sutskever, Oriol Vinyals, Quoc V. Le. Sequence to sequence learning with neural networks. NIPS, 2014.