Neural Machine Translation with Byte-Pair Encoding

Authors: Rico Sennrich, Barry Haddow, Alexandra Birch

Year: 2016

Venue: ACL 2016

Categories: NLP, Machine Translation, Subword Segmentation

Abstract

This paper introduces Byte-Pair Encoding (BPE) as a simple data compression technique that can be effectively applied to neural machine translation. The approach addresses the challenge of handling rare and unknown words in neural translation models by learning subword representations automatically from training data.

Problem Statement

The Core Challenge

Neural Machine Translation (NMT) systems face a fundamental vocabulary problem:

Out-of-Vocabulary (OOV) Words: When the model encounters words not seen during training, it can’t translate them
Rare Words: Even words in the vocabulary that appear infrequently are poorly translated
Morphologically Rich Languages: Languages with complex word forms (like German compounds) create massive vocabularies
Fixed Vocabulary Size: Models must choose between computational efficiency and vocabulary coverage

Real-World Impact

Google Translate struggling with technical terms
Poor translation of proper nouns and domain-specific vocabulary
Inability to handle new words that emerge after training

Previous Approaches and Their Limitations

Dictionary Lookup with UNK Tokens:

Neural Machine Translation systems handled rare or unseen words using fixed vocabularies. Any word outside this vocabulary was replaced with a generic [UNK] token.

How it worked:

During training, rare or unseen words were replaced with [UNK]
At inference, if [UNK] appeared in the translation, the model looked up the source word in a precompiled bilingual dictionary and tried to substitute a likely translation

Why it didn’t work:

Requires external resources (bilingual dictionaries aren’t always available or complete)
Fails for morphologically rich languages, compounds, or creative words (e.g., “moonwalking”)
Breaks end-to-end training: Translation and word recovery are disconnected steps
Can’t adapt dynamically to context or domain-specific terms

Main Idea

The Breakthrough Insight

Instead of treating words as indivisible units, break them down into smaller, meaningful subword pieces that can be recombined to represent any word.

Example:

"unparalleled" → ["un", "parallel", "ed"]

This enables the model to:

Learn patterns at a finer level
Generalize to unseen words (e.g., translate “toothbrush” even if only seen “tooth” and “brush”)

Key Innovation

Use Byte-Pair Encoding (BPE) - a simple data compression algorithm - to learn how to split words into subword units automatically from the training data.

How BPE Works

BPE is a compression algorithm that iteratively merges the most frequent character pairs into subword units. You get a fixed-size subword vocabulary, but can compose any word from subwords. No UNK tokens are needed anymore.

Complete BPE Algorithm

Phase 1: Training BPE

Input:

Large corpus of text
Target vocabulary size (e.g., 50K subwords)

Step 1: Initialize Character-Level Vocabulary

Take every word in your corpus with its frequency count
Split each word into individual characters
Add a special end-of-word marker </w> to each word
This gives you the initial vocabulary where every character is a separate token

Step 2: Count All Adjacent Symbol Pairs

Look at every word in your vocabulary
Count how many times each pair of adjacent symbols appears
Create a frequency table of all symbol pairs

Step 3: Find and Merge Most Frequent Pair

Identify the symbol pair that appears most frequently across all words
Replace all occurrences of the most frequent pair with a single merged symbol
Update your vocabulary with the merged version
The merged symbol becomes a new subword unit

Step 4: Record and Repeat

Save this merge operation in your merge list
This list will be used later for encoding new text
Go back to Step 2 and repeat the process
Continue until you’ve learned the desired number of merge operations
Typically, you’ll perform 10,000-50,000 merge operations

Phase 2: Encoding with BPE

Input:

New text to encode
List of merge operations learned during training

Step 1: Initialize with Characters

Take the input word
Split it into individual characters
Add the end-of-word marker </w>

Step 2: Apply Merge Operations in Order

Take the first merge operation from your learned list
Scan through the character sequence looking for that specific pair
If found, merge those two symbols into one

Step 3: Continue with Next Merge

Apply the second merge operation to the current sequence
Keep going through all merge operations in the order they were learned

Step 4: Stop When No More Merges Apply

Continue applying merge operations until no more pairs in your sequence match any remaining merge operations
The final sequence is your BPE encoding

Why BPE Works

1. Eliminates Out-of-Vocabulary (OOV) Problem

Any word can be represented as a sequence of subwords
If you can’t find “unparalleled”, you can use “un-parallel-ed”
Mathematical Guarantee: With character-level fallback, coverage is 100%

2. Captures Linguistic Patterns

Learns meaningful morphological units: prefixes, suffixes, roots
“un-“, “-ness”, “-ing” become natural subword units
Frequency-based learning captures language statistics

3. Balances Efficiency and Coverage

Frequent words stay whole: “the”, “and”, “is”
Rare words get broken down: “antidisestablishmentarianism”
Optimal vocabulary size for computational efficiency

Experimental Results

The paper demonstrates significant improvements:

English-German: +1.1 BLEU over baseline
English-Russian: +1.3 BLEU over baseline
Rare word translation: 60% improvement in accuracy

Mathematical Formulation

The BPE algorithm can be formalized as:

\[\text{BPE}(w) = \text{merge}(\text{split}(w))\]

Where:

\(w\) is the input word
\(\text{split}(w)\) creates character-level segmentation
\(\text{merge}()\) applies learned merge operations

Implementation Example

Here’s a simplified Python implementation:

def get_stats(vocab):
    pairs = {}
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols)-1):
            pair = (symbols[i], symbols[i+1])
            pairs[pair] = pairs.get(pair, 0) + freq
    return pairs

def merge_vocab(pair, vocab):
    bigram = ' '.join(pair)
    replacement = ''.join(pair)
    new_vocab = {}
    for word in vocab:
        new_word = word.replace(bigram, replacement)
        new_vocab[new_word] = vocab[word]
    return new_vocab

Conclusion

BPE represents an elegant solution to a fundamental problem in neural machine translation. Its simplicity, effectiveness, and broad applicability have made it a cornerstone technique in modern NLP. The paper’s impact extends far beyond machine translation, influencing the entire field of natural language processing.

Explained on December 19, 2024