Lessons From Linguistics: i18n Best Practices for Front-End Developers

Opens in a new windowOpens an external siteOpens an external site in a new window

In our globalized and interconnected world, apps and websites are increasingly being targeted at an international audience. Developers are more likely than ever to be working with content that is intended to be served in several languages. For instance, the Shopify admin is available in 21 languages.

There are many powerful tools available to us to help us in our internationalization (i18n) workflows. Whatever your programming language or front-end framework of choice, there will most likely be an open source library ready to handle the more tricky aspects of translation, such as differing pluralization rules, or complex formatting of dates or currencies.

They aren't magic though! If you simply took an English-language website, moved all of the text strings into language files according to your chosen i18n library, and translated all of those strings as-is, you would almost certainly find errors when viewing the site in other languages, ranging from small grammatical errors to fully garbled sentences. It's natural for us to bake into our code our own internalized assumptions about language based on the ones we know, and the cracks may only show in translation.

Throughout this post, we’ll look at various ways that English grammar can be internalized in code, leading to errors in translation, and at the end I'll provide some development best practices to help you avoid these common pitfalls.

A Simple Example

Take a look at this image subtitle:

Preview modal showing a picture of a dog sleeping on a bed. Below the picture it says "Tony.jpg" followed by "Added: January 1"

The “Added” is static text, while the date will differ based on the image. On a unilingual English site, the component might look something like this:

If you turned this into an i18n-enabled component, it would be natural to convert it to this, where t() is our translate function:

There are two problems here:

  1. The colon punctuation mark “:”, used here as a syntactical-descriptive, is not used the same way (or at all) in many languages.
  2. The order of the words is hardcoded, with “added” preceding the date. This would be incorrect in many languages, from Dutch (“1 januari toegevoegd”) to Korean (“1월 1일 추가”).

This is a simple example with a simple solution: the string needs to be translated as a whole, allowing each translation string to place the dateAdded in the correct place for that language.

A Less Simple Example

Imagine your web app has a page where users can view all of the images they've uploaded. This page has a status bar at the top to surface helpful info to the user, such as:

Four StatusBar examples with the text: "Loading images..." "1 image found" "The image is ready" "Select image"

You've been working hard at expanding your app's functionality and this page now supports both images and videos! The page view can include one or the other, or both file types at the same time, at which point they are called “files”.

Being a clever and industrious developer, you notice that the text in the status bar will be identical for all three cases aside from the noun for the file type, e.g.

  • “Select image”
  • “Select video”
  • “Select file”

This means instead of creating <ImageStatusBar>, <VideoStatusBar>, and <FileStatusBar> as separate components, you can extract the core logic into a generic <StatusBar> that is shared between all three. You create your strings in the JSON translation files with a placeholder for the file type, and then the fileType can be swapped in.

This pattern where a translation string has a placeholder, which is later replaced at the code level, is called interpolation. It’s an essential function of i18n libraries that allows dynamic and static text to be used together—it was, in fact, the solution to our previous “Added” example. The component abstraction we've demonstrated here is a common practice for developers, as it should be: abstraction allows us to reuse the same component across multiple similar use cases, which reduces the total amount of code written and leaves the codebase easier to maintain.

Unfortunately, this particular win for efficiency causes a whole host of problems in translation, which may not be intuitive to people who exclusively or primarily speak English, because these problems only arise with grammatical concepts that do not exist at all in English.

Let's go on a tour of a few of these concepts that front-end developers should at least have some awareness of, bringing this <StatusBar> along the way as an example. 

Pluralization

Of course, pluralization exists in English. We have a (mostly) consistent system where a word has singular and plural forms, and the plural is usually denoted with an “s” at the end of the word. Interestingly, the plural form is used for a count of 0 as well, so it’s more accurate to say that English has singular and other forms of each word; this distinction will become clear in a moment.

Astute readers may have noticed that our <StatusBar>, even before translation, has a mistake. The strings again:

Four StatusBar examples with the text: "Loading images..." "1 image found" "The image is ready" "Select image"

We haven't accounted for singular and plural versions! The component needs both. Perhaps we can pass both singular and plural strings into the props to be interpolated?

This piece of logic, where we determine singular/plural based on count === 1, should set off an alarm bell: this is a human language rule being baked into our application code. Is it possible for other languages to have different pluralization rules? Is there a language that uses the singular word form for a zero count, or even a different word entirely?

The answer is yes—Hindi and Latvian, respectively, are examples of these cases. And there are even more categories of pluralization! The Unicode Common Locale Data Repository (CLDR) identifies six plural forms:

  • zero
  • one (singular)
  • two (dual)
  • few (paucal)
  • many
  • other

You may have never even considered that a language would have dedicated words to denote two of something, or to distinguish “a few dogs” from “many dogs”! What's more, this is only discussing plural word forms. The rules for when and how to use each word adds yet another layer of complexity.

Case Study: Polish

Polish has distinct word endings for “one”, “few” (2-4), “many” (>4), and “other”.

1 dog ⟶ 1 pies2 dogs ⟶ 2 psy5 dogs ⟶ 5 psów

1.5 dogs ⟶ 1,5 psa

However, it's not as simple as using “many” as soon as you're past four: the “few” ending is used for numbers that end in 2-4, no matter how big the number itself is.

20 dogs ⟶ 20 psów22 dogs ⟶ 22 psy94 dogs ⟶ 94 psy

95 dogs ⟶ 95 psów

...except for 12-14. Those use “many” after all.

12 dogs ⟶ 12 psów13 dogs ⟶ 13 psów

14 dogs ⟶ 14 psów

Returning to our <StatusBar>, it's clear that singular and plural are insufficient. Could we try to also pass in zero, few, many, and other props to ensure that we're accounting for plurality across languages? While this would cover all of the plural word forms, this generic interpolation still leaves a number of potential problems with the overall sentence beyond just the one word.

Gender

In many languages, nouns have a gender. Some languages have masculine and feminine, some add a neuter gender, some have animate/inanimate (either instead of or in combination with masculine/feminine), and a small number languages have additional options like third (or more) or dual gender, animal/vegetable, and large objects and liquids.

In gendered languages, the articles and adjectives relating to a noun also reflect that noun's gender. This means that interpolated strings are going to be a problem!

In our <StatusBar>, we had this string:

"The {fileType} is ready"

This is pretty straightforward in English:

"The file is ready""The image is ready"

"The video is ready"

But in French, these would be:

"Le fichier est prêt""L'image est prête"

"La vidéo est prête"

All three are different! “Fichier” is masculine, while “image” and “vidéo” are feminine. The article changes with gender (“le” masculine, “la” feminine), and the word for “ready”, “prêt”, gains an “e” at the end when referring to a feminine noun. We've also stumbled on an additional rule, where “la image” gets contracted into “l'image” since “image” begins with a vowel.

This means there is no reliable way to interpolate the {fileType} into our <StatusBar> text in French. In fact, any interpolation we choose would result in something that is on average more often incorrect than correct!

Declensions

I want to end with an example that illustrates the vastness of human language, a concept that exists so far outside of the English wheelhouse that many people may never have considered that such a grammatical construction could exist: declensions!

Declensions are a modifier made to the end of a word based on its syntactic function in a sentence. There will be a base word and its ending will change based on, for example, whether it is the subject or object in the sentence, or whether it is inside or on top of something else, or any number of other slight differences in meaning.

They exist in a wide range of unrelated language families across continents, from Quechuan to Indo-European to Bantu. Interestingly, Old English had them, but lost them along the path to Modern English.

English does have some morphologies wherein a word changes based on its role in the sentence. For example, the singular first-person pronoun can be “I,” “me,” “my,” “mine,” or “myself” depending on what grammatical role the person is playing in the sentence. We also have semantic changes, like adding “s” to the end of nouns to indicate plurality or “ed” to the end of verbs to indicate past tense.

Declensions are like that, except for a wider range of words. Imagine if you added a different sound at the end of “dog” for “I chased the dog,” “I am chasing the dog,” “the dog chased me,” and so on. As you stack word roles, genders, tenses, and plurality, you get an absolute explosion of permutations!

Case Study: Polish Again

I won't get too deep in the weeds here since I'm not a Polish speaker myself, but let's walk through some simple examples.

“I like your dog. Your dog is cute.”
“Lubię twojego psa. Twój pies jest słodki.”

Here you can see both “your” and “dog” decline differently based on whether the dog is the subject or object of the sentence.

“I like your dogs. Your dogs are cute.”
“Lubię twoje psy. Twoje psy są słodkie.”

Interestingly, in the plural case the declensions for "your dogs" have converged into “twoje psy.” But note that the word for “cute”, “słodki,” gains an “e” due to the plurality of the noun it refers to, as in our previous French example.

Polish has a plural “you,” akin to the French “vous” or the Hindi “आप.” So if you said “your dogs are cute” but this time you were addressing more than one person, it would be written:

Wasze pieski są słodkie.”

Yet another declension on “dog”!

I hope these examples make clear that we can't really have a pre-written set of strings with a {noun} placeholder that gets swapped out by the code. In many languages, the noun itself will change quite a lot based on how it's used in the sentence. In turn, some of the other words will change based on the noun, completely reshaping the sentence. As such, the sentence needs to be translated as a whole to guarantee correctness.

Moje psy są słodkie!

Three Lessons and Best Practices

We've looked at some fascinating and intricate grammatical constructs that you may have never encountered before. You may have not even thought it possible for a single word to mutate so much within a language! It would be overwhelming to have to keep track of all of these different rules; you shouldn't have to be a linguistics expert just to add a checkbox label.

Thankfully, we have excellent tools to take these worries off our mind! These allow us to do our work without needing to manually account for these manifold rulesets. You don't need to know how declensions work, per se; you simply need to remain broadly cognizant of the concept that, in some languages, words can change based on the context of the sentence they're in. Hewing to this principle will guide you towards best practices; I offer three broad lessons below.

Lesson 1: Interpolate with Caution

After experiencing the breakdown of our beloved <StatusBar> component, you may conclude that you should never interpolate. Sure, that would be safest for the translated text, but it's unlikely you can completely avoid interpolation. Consider this text:

The word “supported images” is a link to another page, and that link needs to be provided in the code, so we have no choice but to interpolate this string:

Although the sentence gets broken up into two distinct strings, this setup is OK as long as the two are translated together, understood as the same sentence. The translator can translate the “supported images” text knowing its role in the larger sentence, and apply the correct plural morphologies and declensions.

We'd only get into trouble if we tried to reuse either of these strings outside of this specific sentence context. For example, if we tried to get clever again and make a generic <LearnAboutFooter> where we pass other nouns into the “Learn more about {noun}” sentence, this would be repeating the problem we encountered in our <StatusBar> example. Similarly, if we pulled out the “supported images” text and used it as a heading somewhere else, that would be problematic because this context may require different versions of those words.

Lesson 2: Do Not Manually Construct Sentences or Manipulate Text in Code

If you remember our “Added: January 1” example, the error there came from us assuming and hardcoding the word order, and the solution was to translate the sentence as a whole and interpolate the date.

Stated more generally, we should avoid doing any kind of construction or manipulation of text at the level of the code, which could mean word order, punctuation, line breaks, or forcing upper/lower case. This may not be so easy to avoid as in our “Added” example. Take a look at the following design:

The contrast between the different elements makes this an eye-catching design, so it's quite common to see promo text use flashy layouts like this. Unfortunately, this is near-guaranteed to break in translation. There are six distinctly styled elements:

  • Get 10 images
  • per month for only
  • $
  • 2
  • .99
  • /image

Each needs to be a separate element in the code. Handling spacing, line breaks, differing word orders, and multiple currencies in a string with 6 interpolated elements is realistically not tenable. Conversely, hardcoding the order of elements to preserve the styling would produce an incorrect sentence in multiple languages. The only reliable way to handle an elaborate text layout like this is to serve it as an image, with a different version created for each language—or, alternatively, simplify the design to have fewer line breaks and distinctly styled elements.

Lesson 3: Let i18n Libraries Handle the Hard Stuff

If you read the section on pluralization and concluded that it's far too complicated for your code to keep track of all of these rules across different languages, you would be correct! Managing these complex and often conflicting rulesets is exactly the type of thing that i18n libraries are built to solve.

There are a wide range of i18n libraries; the pseudocode examples in this article are modeled on the popular i18next JavaScript framework, which has been adapted to dozens of web frameworks including React, Vue, and Rails. In general, i18n libraries will understand these pluralization rules so that you can simply pass in a quantity and the correct plural form will be returned.

For example, if your English translation file was:

The equivalent Polish file would look like this, including the different plural forms that don't exist in English:

Then, the function call to use this text would simply be:

This is such a convenient workflow! You, as a developer, don't need to remember which key is used for which number, or even that word forms other than singular/plural even exist. You just pass in the count itself, and the library knows which one to use in every language.

In fact, it's quite important that you don't reach in and use, say, dogCount.many directly. As we explored above, many will be used for different values in different languages, or it may not exist at all. Hardcoding the use of the many word form is bound to be incorrect some of the time. Leave such headaches to the library; that's what it's there for.

Know Grammar? No Problem

Whether you're serving two or 21 languages, multilingual content is inherently going to be a tricky problem space. Fortunately, we're equipped with some very nice tools that take care of the hard stuff. As long as we follow a handful of best practices and guiding principles, we can ensure that we're giving all of our users the best possible experience in their language of choice.

I hope this new knowledge will help you make multi-language-safe code in the future, and that you walk away from this article with a renewed sense of the breadth and beauty of human expression.

Lucas Huang is a Senior Front-End Developer at Shopify. He can be found on GitHub, GitLab, Mastodon, and at any number of dog parks and thrift stores in Montréal.

We all get shit done, ship fast, and learn. We operate on low process and high trust, and trade on impact. You have to care deeply about what you’re doing, and commit to continuously developing your craft, to keep pace here. If you’re seeking hypergrowth, can solve complex problems, and can thrive on change (and a bit of chaos), you’ve found the right place. Visit our Engineering career page to find your role.

首页 - Wiki
Copyright © 2011-2024 iteam. Current version is 2.137.1. UTC+08:00, 2024-11-23 06:48
浙ICP备14020137号-1 $访客地图$