[placeholder]

Template Search in Six Languages Using Machine Translation

Every Squarespace trial begins with an important choice: selecting a template. A template is the starting point to a cohesive look and structure of a website.

To streamline the template selection process, we released a search bar in the template store. Search terms are matched to templates that other customers selected for similar purposes. For example, if you search for “yoga studio,” you'll see templates that other customers chose for their yoga studio websites. This enables new users to quickly choose their starting template using the past decisions of successful users, which resolves a key pain point for user onboarding.

Squarespace is now available in six languages: English, Spanish, French, German, Italian, and Portuguese. After releasing the template search in English with support for over 100,000 unique search terms, we needed a quick way to provide the same experience in five additional languages. This post describes how we quickly internationalized the template search along with some challenges we faced along the way.

Template search in spanish
Template search in German
Template search in Portugeuse

The first English version of template search was built by matching search terms to templates through content-based representations of templates from customer site descriptions. Given a query, templates are ranked by matching the query to the template representations.

For example, let’s say we have the following user site descriptions and corresponding templates:

TemplateSite Description
Bedford “Let’s do yoga!”
Bedford “Yoga is good”
Foundry “I like to drink kombucha”

First we tokenize each site description and assign each word to a numerical vector that encodes its semantic meaning (i.e. word embeddings). For example, if we use pre-trained Fasttext vectors, “yoga” maps to [0.32074 0.089077 -0.53238, …., -0.31365] and “kombucha” maps to [-0.42905 0.10066 -0.49866, …., 0.024994].

Then we group all the tokens by template and assign weights based on word frequency. In the case above, “yoga” receives a higher weight than “good” for the Bedford Template since “yoga” appears in two site descriptions; “good” only appears in a single site description. We use these weights to compute a weighted average word embedding for each template:

TemplateVector Representation
Bedford [3, -1, 2, …, -1]
Foundry [1, 0, -1, …, 3]

When a user enters a search query into the template store, we calculate the average word embedding of the query. We then rank templates using the cosine similarity between the word embedding of the new query and the template representations.

Using the example above, if a user searched for “exercise”, the Bedford Template would rank higher than the Foundry Template. The word embedding for “exercise” is closer to “yoga”—which appears in site descriptions for Bedford—compared to “drink” and “kombucha”—which appear for Foundry.

To internationalize template search, we needed to support five additional languages with consistent experiences across all languages. For example, “food” in English should return the same results as “comida” in Spanish. We also wanted to minimize changes to our existing codebase, service performance, and deployment process.

We ultimately decided to keep the English template search unchanged and translate all non-English queries into English, rather than build a separate template search for each language. We had no site descriptions in other languages besides English, and it was easier to keep the template search experience consistent across languages if we separated the template search implementation from the translations. As an added bonus, the translation problem was greatly simplified by the fact that 99% of search queries were only one to two words in length. This meant that translating words one by one might be sufficient rather than having to translate longer and more complex phrases.

Translation Options

We cycled through several options to translate search queries into English for our use case. Human translation services were not ideal since we used large word vocabularies in the existing search. We chose not to use paid machine-translation services because we felt that we could solve the problem in-house with open-source solutions that had better service performance.

We first tried bilingual dictionaries with Open Multilingual Wordnet, but the limited vocabularies were not appropriate for our use case. We then tried Neural Machine Translation (NMT) models, similar to what paid machine translation services use, and we quickly achieved state-of-the-art translations on tasks from the 2016 Conference on Machine Translation using TensorFlow’s Neural Machine Translation.2 3 4 NMT models use large Recurrent Neural Networks to decode sequences of words from one language to another. Despite the impressive sentence-based translations, training time takes two to three days per language pair on a single GPU, and we only really needed single word translations.

For a faster solution, we looked into open-source statistical machine-translation models such as fast-align, which model translation through a combination of word alignments between sentences and transition probabilities between word pairs.1 We found that the translation probability tables from fast-align performed on par with both paid machine translation services and NMT models on our benchmarks for single-word queries. Fast-align was also roughly 24x faster to train than NMT models per language pair.

We ultimately decided to move forward with fast-align as our MVP for translation because of its ease of use and good translation benchmark performance.

Learning Translation Dictionaries With Fast-Align

To learn the translation dictionaries with fast-align, we downloaded 30–70 million lines of parallel data for each language from The Open Parallel Corpus. After cleaning and tokenizing the data, we ran fast-align on all language pairs.1 We saved the transition probabilities between word pairs and then used these dictionaries to translate queries to English in the existing template search.

Benchmarks

We used two benchmarks to evaluate the quality of our translations. The first benchmark compared our translations to other machine-translation systems, like Google Translate. The second benchmark evaluated our translations on an open-source task developed for the International Workshop on Semantic Evaluation.

The first benchmark helped us gauge our translation quality compared to Google Translate on the most popular template search queries. We took the top 1,000 queries in the existing template search and translated them to five languages using Google Translate as our ground truth. The task consisted of translating words in each language back to English as we would in production, to see if the translations matched the original English words.

Let’s take the word “exercise” as an example. First, we translated “exercise” to five languages using Google Translate; we got “ejercicio” in Spanish. Next we translated “ejercicio” back to English, which was only marked as correct if we got “exercise.”

We report the accuracies of translations back into English in the table below for fast-align, Google Translate, and bilingual dictionaries open-sourced by Facebook.5 We found that fast-align performs a bit worse than Google Translate but much better than Facebook dictionaries.

LanguageFast-alignGoogle TranslateFB Dictionaries
FR 0.690 0.736 0.652
ES 0.729 0.766 0.616
IT 0.681 0.736 0.632
PT 0.723 0.760 0.190
DE 0.725 0.742 0.595

While this benchmark is useful, it is a bit biased. First, we assumed that the first translation done by Google Translate was correct, despite missed cases (e.g. “band” in English was translated incorrectly to “B: et” in French). Additionally, Google Translate gets an unfair advantage for words with polysemy. For example, “advice” was translated to French as “conseil” by Google Translate. When we translated “conseil” back to English, Google Translate predicted “advice” which is correct. But fast-align predicted “council,” which was marked as incorrect despite being a valid translation.

Task 2: Benchmark On SemEval 2017 Task 2

The SemEval 2017 Task 2 benchmark helped us gauge our translation quality against human evaluators, resolving the biases outlined in the task above.6

Task 2 contains several cross-lingual word pairs that need to be scored from 0 to 4 based on their word similarity. This tests both the quality of translations between languages and the quality of word embeddings in those languages. For example, “self-driving car” in English should be marked with a high score (e.g. 4.0) if presented with “vehículo autónomo” in Spanish, since it’s a direct translation. A slightly lower score should be generated (e.g. 3.1) if comparing “self-driving car” with “autobús” (bus) in Spanish since they are not direct translations but are still both means of transportation. Each word pair is given ground-truth human similarity scores, and the final score for Task 2 is computed by comparing the human scores with the ones generated by our system.

The results for Task 2 are reported in the table below. First, we used Fasttext word embeddings trained on Wikipedia data while using our fast-align dictionaries to do the translations between languages. We performed identically to the SemEval 2017 baseline model in German, but 3–4% worse in Spanish and Italian.7 To remove further bias in the evaluation from the quality of word embeddings, we used English vectors similar to those used by Luminoso_run2, the SemEval 2017 winner (i.e. English Numberbatch 17.06 vectors).8 Our scores beat the baseline by 2 - 10% and would’ve come in at second place during SemEval 2017. While we still performed worse than Luminoso_run2, our unsupervised translations are clearly on par with those learned through structured knowledge graphs such as BabelNet.9

EN-ESEN-DEEN-IT
NASARI (baseline)9 0.63 0.60 0.65
Luminoso_run28 0.76 0.76 0.78
Fast-align with Fasttext 0.61 0.60 0.62
Fast-align with EN Numberbatch 17.06 0.68 0.66 0.66

Benchmarking Conclusion

We found that our fast-align dictionaries are on par with Google Translate for one- to two- word queries and would’ve come in at second place for the SemEval 2017 Task 2 competition. Fast-align provides fast training and inference without too much sacrifice in accuracy as compared to systems that use neural machine translation and structured knowledge graphs for SemEval.

Revisiting the Example

To implement template search for the five additional languages, we simply used fast-align dictionaries to translate all queries back to English. In the previous example for English template search, we said that “exercise” in English would give us back the Bedford template. If we now search for “ejercicio” in Spanish, the following steps are performed:

  1. We translate “ejercicio” to “exercise” in English using fast-align dictionaries.
  2. We perform the same English template search as we had before.
  3. We get back the Bedford template with the same ranking as before.

Outcomes

We deployed all six languages in our search service in late November of 2017. We also hit our functional targets: our deployment was seamless, service performance remained the same since we only added O(1) dictionary lookups, English template search was unchanged, and adding translations required only one extra line of code. Translations can easily be added to other service pipelines with a minimal code change:

steps = [
            ('tokenizer', tokenize_transformer),
 +           ('translators', translators),
            ('vectorizer', vectorizer),
            ('estimator', model),
        ]

While our approach was simple to implement, its translation accuracy for single-word queries is on par with existing machine-translation services and knowledge graph–based word representations.

Since our release, we are improving template search results by using supervised models that directly predict templates from historical searches. New models will also leverage the translation methods outlined here.

Check out template search in Spanish, French, Portuguese, Italian, or German. We also added support for all languages in the English version of template search.

References

  1. C. Dyer, V. Chahuneau, and N. A. Smith. 2013. A Simple, Fast, and Effective Reparameterization of IBM Model 2. In Proceeding of NAACL-HLT, Association for Computational Linguistics, pages 644-648. PDF

  2. M. T. Luong, E. Brevdo, and R. Zhao. 2017. Neural Machine Translation (seq2seq) Tutorial. Accessed February 27, 2018. https://github.com/tensorflow/nmt.

  3. J. Tiedemann. 2012. Parallel Data, Tools and Interfaces in OPUS. International Conference on Language Resources and Evaluation, pages 2214-2218.

  4. ACL 2016 First Conference on Machine Translation (WMT16). Accessed February 28, 2018. http://www.statmt.org/wmt16/.

  5. A. Conneau, G. Lample, L. Denoyer, MA. Ranzato, H. Jégou. 2017. facebookresearch/MUSE: A library for Multilingual Unsupervised or Supervised word Embeddings. Accessed February 27, 2018. https://github.com/facebookresearch/MUSE.

  6. J. Camacho-Collados, M. T. Pilehvar, N. Collier, and R. Navigli. 2017. SemEval-2017 Task 2: Multilingual and Cross-lingual Semantic Word Similarity. PDF

  7. A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov. 2017. Enriching Word Vectors with Subword Information. Association for Computational Linguistics. PDF

  8. R. Speer and J. Lowry-Duda. 2017. ConceptNet at SemEval-2017 Task 2: Extending Word Embeddings with Multilingual Relational Knowledge. In Association for Computation Linguistics. PDF

  9. J. Camacho-Collados, M. T. Pilehvar, and R. Navigli. 2016. Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities. Artificial Intelligence, 2016. Pages 567-577.

The Nuts and Bolts with Ed Bridges

Building a System for Front-End Translations