Duolingo- Learning a Language while Translating the Web
如果无法正常显示,请先停止浏览器的去广告插件。
1. Duolingo: Learning a Language while Translating the Web
Severin Hacker, Carnegie Mellon University
1 Abstract
We built a new system that translates text cheaply, accurately and quickly. It achieves this by
splitting the translation work into smaller tasks and crowdsourcing those tasks to people who
want to learn a foreign language. In our system the students learn a foreign language while
simultaneously translating text. From the student’s perspective, the system looks and behaves
like a normal language learning website. Duolingo is an example of human computation, where
computer systems enable massive collaborations between humans and computers for the benefit
of humanity.
2 Introduction
Our work starts with a simple question: how can we translate the web into every major
language?
Translation of text is expensive, slow or inaccurate. Translation by professional human
translators is accurate but costly and slow. Current cost estimates for professional translation
range from 5 cents a word to 35 cents a word, depending on the desired quality, the language pair
and the nature of the content (Romaine & Richardson, 2009). A professional translator can only
translate around 2’500 words per day. On the other hand, translation by machines is fast and
cheap but inaccurate (yet). To this day there exists no way of translating text cheaply, quickly
and accurately. Duolingo aims to fix this problem.
Translation by human translators is expensive because the translators are professionally
trained and because they must understand two languages well. It is generally assumed that a
sound knowledge of the two languages is required in order to translate accurately. Unfortunately,
there are not that many people in the world who speak two languages well. Therefore our next
question is: how can we translate the web with people who speak one language only?
A very simple experiment in 2008 showed us that this, as paradoxical as it may seem, can be
done. In this experiment, we gave the participants a short fragment of a news article written in
German and we asked them to translate it into English. In addition, we helped them by giving
them multiple dictionary translations for each word in the sentence (see Figure 1). While not
every participant could translate every sentence correctly, when we compiled the best
translations for each sentence into a document, that final document was a very accurate
translation of the original. This experiment showed us that we can translate complex sentences
with people who know only a single language.
But how can we motivate millions of people to translate? How can we motivate people to do
something for free that so far was a task professionals were paid to do? The answer is: we help
the people learn a language while they translate. This is the core idea of Duolingo: you learn a
1
2. foreign language while you simultaneously translate text. This duality is also the reason for
the name "duo-lingo".
Figure 1. English speakers who do not know any German can successfully translate this German sentence
(underlined text) when each word is shown with all its possible English translations. Subjects correctly
translated the last phrase as “the straw that broke the camel’s back” even though the German equivalent is
roughly “the drop that made the barrel overflow.”
2.1
Related Work
Duolingo is part of a new research area, called human computation. The goal of human
computation is to develop theories and build computer systems that enable massive
collaborations between humans and computers for the benefit of humanity.
One example of human computation is reCAPTCHA (von Ahn, Maurer, McMillen,
Abraham, & Blum, 2008), a service that digitizes books by piggy-packing on a CAPTCHA. A
CAPTCHA is a task that humans can do but machines cannot; it is a security measure to prevent
automated programs from abusing web services. A reCAPTCHA is a special CAPTCHA which
plays a double role: (1) it guarantees the website owner that the solver of the reCAPTCHA is a
human being and (2) by aggregating the human solvers’ answers it helps digitizing books. With
reCAPTCHA hundreds of millions of people help transcribing books when they sign up for new
web services.
Crowdsourcing and human computation are more less the same concept: in crowdsourcing,
too, a task is outsourced to a large group of people. The classic example for crowdsourcing is
Amazon Mechanical Turk (Amazon). Several researchers (Zaidan & Callison-Burch, 2011),
(Callison-Burch, 2009) have used mechanical turk for the tasks of machine translation and
machine translation evaluation. They report that for both tasks non-experts, when combined,
achieve expert-level results.
Since Duolingo is at the same time also an online, adaptive language tutoring system. It is
related to research in Intelligent Tutoring Systems (ITS) (D Sleeman, 1982). Although, most ITS
2
3. were written for the context of teaching mathematics and problem-solving, there are various
systems written for the purpose of teaching a first or second natural language. For example
(Shaalan, 2005) mentions a computer-assisted language learning system for learning Arabic.
(Gamper & Knapp, 2002) give an overview of current systems and a concise classification
scheme.
3 System Overview
While students in Duolingo learn a foreign language and simultaneously translate text, not all
aspects of a language can be taught with translations. For those other aspects, we developed
tailor-made lessons. The lessons are manually created learning units that teach new grammatical
concepts and new vocabulary through interactive exercises called challenges. These challenges
cover all aspects of a language: writing, reading, listening and speaking. As a result Duolingo is
organized into two main areas: the lessons and the translations. The lessons’ purpose is to teach
the students new vocabulary and new grammatical concepts. With the translations, the students
practice the newly acquired knowledge by translating text from the real world.
While we know all the correct answers for the lessons we do not know the correct answers
for the translations beforehand. Therefore, when we grade a student’s answer we can do so
perfectly for the lessons but only approximately for the translations. All the underlying data for
the lessons is manually created whereas the content for the translations comes from the Internet.
We summarized these differences between lessons and translations in Table 1.
teaches
correct answer known
translating web content
data source
Lessons
vocabulary, grammar
yes
no
manual
Translations
real-world usage
no
yes
web
Table 1. Comparing Lessons and Translations
In Duolingo, the lessons and translations are grouped into skills. A skill is a list of words. To
learn a skill means to learn the words in that list. Each skill has multiple lessons and multiple
translations associated to it. In the lessons that are associated with a skill we teach a subset of the
words in that skill. For example, the first lesson may teach the first seven words in that skill, the
second lesson the next seven words, etc. The translations that are associated to a specific skill
contain words that are also in that skill.
Currently, Duolingo teaches three languages: English, Spanish and German. The languages
were chosen by multiple criteria: (1) US demand, (2) world-wide demand, (3) languages spoken
by team members, (4) similarity to Western European languages. Table 2 lists the most
frequently learned languages in the US in 2006. Since in Duolingo students learn by translating
text we actually have to look at the learning directions. For example, currently, you can learn
German or Spanish if English is your native language. You can also learn English if Spanish is
3
4. your native language. However, we currently do not support the directions “learn English if
German is the language you know” nor “learn German if Spanish is the language you know”, etc.
We call the language that you are learning the learning language or the “to language”, and the
language that you are native in, the “from language”. Our goal is to add the directions French
(when you know English), English (when you know Portuguese) and English (when you know
Chinese) in 2012.
Language
% of enrollment
Spanish
52.2
French
13.1
German
6
Italian
5
Japanese
4.2
Chinese
3.3
Table 2. Most frequently learned languages in the US in 2006.
When students do a translation from the web, they always translate from their learning
language to their native language. Because native speakers of a language write all sentences we
expect the translations to be fluent.
Lessons
Translations
Skills
Figure 2. System Overview: The Lessons and Translations are built on top of the skills
4 Skills
The Duolingo skill system is our answer to the question: what should I learn (next)? In
Duolingo we teach 3,000 words. This is based on research that shows that if you have an active
vocabulary of 3,000 words you are fluent enough to have a basic conversation and you are able
to read a newspaper. For example (Hirsh & Nation, 1992) mention that with a vocabulary size of
2,000 words the reader will be familiar with 90-92% of the words in three novels. We chose the
3,000 words by taking the most frequently used words in the written and spoken language.
We then take those 3,000 words and split them up into distinct lists of words. Each list of
words we call a skill. Figure 3 shows some of the words being taught in the skill Basics 1 for
Spanish. The idea of the skills is to group words together that should be taught together. There
are three kinds of skills: (1) introductory skills, (2) vocabulary skills and (3) grammar skills. For
example, Basics 1 is an introductory skill. The goal of the introductory skills is to introduce very
4
5. basic words like the personal pronouns I, you, she that are necessary to get started. The goal of
the vocabulary skills is to teach new vocabulary. For example, the skill Food in Spanish teaches
various words related to food, like: wine, drink, eat. Finally, the goal of the grammar skills is to
teach new grammatical concepts like plurals or the past tense. We manually created between 40
and 50 skills for the languages we want to teach.
Each skill has a fixed number of skill points that students gain when they learn the skill. The
maximum number of skill points depends on the number of words that are being taught in a skill.
For example, the skill Basics 1 in Spanish teaches 27 words and students can gain 96 skill points.
For each skill, there are two thresholds: (1) the learning threshold and (2) the mastered
threshold. The learning threshold is set manually for each skill. When the student has more skill
points than the learning threshold we say that the student learned that skill. The mastered
el/el<det><def><m><sg>
hombre/hombre<n><m><sg>
la/el<det><def><f><sg>
mujer/mujer<n><f><sg>
yo/prpers<prn><tn><p1><mf><sg>
un/uno<det><ind><m><sg>
una/uno<det><ind><f><sg>
soy/ser<vbser><pri><p1><sg>
niño/niño<n><m><sg>
niña/niño<n><f><sg>
él/prpers<prn><tn><p3><m><sg>
Figure 3. Some of the words being taught in Basics 1 together with their grammatical forms
threshold is equal to the maximum number of skill points for that skill. When the student has
reached the maximum number of skill points, we say that the student mastered that skill.
The skills are arranged in a skill tree where each node is a skill (a list of words) and there
exists an edge from a skill A to a skill B if the students need to learn B before they can learn A,
as described in Figure 4. Learning a skill means learning the list of words associated to the skill.
Thus, the skill tree enforces an order on what the students can learn at any given time. Figure 6
shows an example of a skill tree. For example in the skill tree in Figure 6, after the students have
learned the skill Basics 1 they can continue with Basics 2 or Common Phrases.
A
B
Figure 4. There is an edge from skill B to skill A if you need to learn A before you can learn B
The skill tree not only gives the learning process a clear structure it also allows the students
to “unlock” new skills. This process of “unlocking” of skills is very motivational and is inspired
5
6. by technology trees in many successful strategy games such as civilization (Meier & Bruce,
1991). When the students log in to Duolingo, the first thing they see is the skill tree as shown in
Figure 6. We arranged the skills manually into a skill tree by using textbooks as a source of
inspiration.
Each skill has its own page in Duolingo. The skill page appears when the students click on a
skill in the skill tree. On the skill page the students can (1) do a lesson, (2) do a translation or (3)
take a test.
Figure 5. The skill page for Basics 1. The students can do a lesson (on blue background), do a translation
(on dark green background) or take a test (top right).
6
7. Figure 6. An example of a skill tree. In this skill tree, the skill Plurals can only be learned after the skills
Phrases, Food, Animals have been learned.
5 Lessons
The goal of the lessons is to teach new vocabulary and new grammar concepts. Unlike the
translations, the correct answer is always known beforehand. Also, when the students do the
lessons they do not translate content from the web.
Each lesson is a sequence of 20 challenges. A challenge is a mini learning task. There are 6
different challenge types: name, listen, translate, form, speak and judge. We will discuss the
individual challenge types in the next chapter. At the start of every lesson the students have three
hearts. Whenever they make a mistake they lose one heart. When they have no more hearts and
they make another mistake they fail the lesson. When the students complete the lesson
successfully, they receive 10 skill points.
7
8. 5.1
Challenges
5.1.1 Name Challenge
In the name challenge the student is shown four pictures and she has to say what she sees in
the pictures. The name challenges always ask for a single noun like “boy” or “woman”. For
languages that have the concept of a grammatical gender like German and Spanish the student
also has to pick the correct article (e.g in Spanish “el” for masculine nouns and “la” for feminine
nouns). Figure 7 shows an example of a name challenge in Spanish. With the name challenge
the students improve their writing skills.
Figure 7. In the name challenge the student has to type the word that corresponds to the four pictures.
5.1.2 Translate Challenge
In the translate challenge the student sees a sentence in a source language and she has to
translate that sentence to a target language. There are two kinds of translate challenges
depending on which language is source and target languages: (1) (normal) translate challenges,
where the learning language is equal to the source language and the target language is equal to
the student’s native language and (2) reverse translate challenges where the source language is
equal to the student’s native language and the target language is equal to the learning language.
Clearly, the reverse translate challenges are much harder than the normal translate challenges
because the student has to compose correct sentences in a foreign language. Error! Reference
source not found. shows an example of a translate challenge. With the translate challenges the
students improve their ability to read, write and understand the language that they are learning.
When the students hover over a word we give them context-sensitive translations of that
word. We call these context-sensitive translations hints. Our hints are usually very accurate and
we will see their use again in the translations in chapter 6 and their capabilities and
implementation in chapter 9.
8
9. Figure 8. In the (normal) translate challenge the student has to translate a single sentence.
The student’s translation is graded automatically by comparing it with a list of correct
translations. If it matches one of the correct translations we say that the translation is correct, else
it is wrong. However, our matching is lenient in the sense that we accept certain kind of typos
and minor mistakes (like “a”/”an”). The correct translations come from our seed set that we will
discuss in chapter 5.2.
5.1.3 Judge Challenge
In the judge challenge the student sees a sentence and has to mark the correct translations
from a list of candidate translations. More than one translation can be correct and the student has
to select all of them. Figure 9 shows an example of a judge challenge. The correct answers again
come from our seed set. With the judge challenge we can highlight small but important
differences such as the relationship between subjects and objects; thus the students improve their
reading and writing skills.
9
10. Figure 9. In the judge challenge the student has to mark the correct translations from 3 candidates.
5.1.4 Listen Challenge
In the listen challenge (see Figure 10), the student listens to a short audio clip and is asked to
type the sentence she hears. The sentence is spoken in the student’s learning language and should
also be typed in the student’s learning language. Listen challenges are among the harder
challenges in a lesson. With the listen challenge the students improve their ability to understand
the spoken language.
Currently, we take a sentence from our seed set and run it through text-to-speech synthesis
(TTS) and then present it to the student. This way, we always know the correct answer and can
grade the student’s answer perfectly.
Figure 10. In the listen challenge the student hears a short sound clip and has to type what she heard.
10
11. 5.1.5 Speak Challenge
In the speak challenge (see Figure 11), the student speaks a given sentence into his
microphone and the system grades the pronunciation. The system runs the utterance through a
speech recognition system and then compares the output to the reference sentence. The grading
either returns “correct” or “wrong”. Due to imperfections of the speech recognition system, the
students, unlike in the other challenge types, always get another chance when they fail a speak
challenge. Students who do not have a microphone can turn off speak challenges altogether.
With the speak challenge we improve the student’s ability to pronounce phrases in the language
that they are learning.
Figure 11. In the speak challenge the student has to speak a short phrase into the microphone.
5.1.6 Form Challenge
In the form challenge (see Figure 12) the student picks a word from a list of words so that the
resulting sentence is correct. Only one option is correct. With the form challenge we can teach
certain small differences of words like conjugations of verbs, gender agreement of adjectives,
etc.
11
12. Figure 12. In the form challenge the student has to pick the correct word from a list.
5.2
The Seed Translations
For each language direction that we teach (e.g. “learning Spanish when English is your native
language”) we have two files with correct translations. These files are manually created. Figure
13 shows a single line from the sentence translations file from English to Spanish. The bolded
word is the surface form of the word that is being tested with the sentence in blue color. The text
in green is the lexeme together with its lexical description. In this case, the lexical description
means that “están” is the conjugated form of the verb “estar” in the 3 rd person plural in the
present tense. The lexical description helps us to disambiguate words that have the same surface
form but different syntactic meaning. For example, in English the surface form “sleep” can both
be a verb in infinitive form (“He is going to sleep.”) or a conjugated verb (“We sleep.”). The
sentence in blue color is the example sentence for “están”. Note that the word “están” must
occur in the example sentence in exactly the form described by the green lexical form and its
bolded surface form.
The words surrounded by curly braces are used for the form challenge. The first word is the
correct word (in this case “están”) and the other words (estoy/está/estás/estamos) are wrong
alternatives for “están”.
The sentence in red color is the display translation. This is what we consider the best
translation of the sentence blue. Whenever a student types a wrong translation we show the
display translation as the correct solution. If there is more than one display translation we show
all of them. The translations in brown color are our accept translations. When a student types an
accept translation we treat the student’s translation as a correct translation. However, the
students generally do not see the accept translations in the user interface.
For each direction we need about 18,000 lines so that we can cover all the words introduced
in the skill files. Our internal goal is to have about three sentences for each lexeme that we teach.
están/estar<vblex><pri><p3><pl>||Hola, ¿cómo {están//estoy/está/estás/estamos} ustedes?||Hello, how
are you all?||[Hello/Hi], how are [you/you all/y'all]?
12
13. Figure 13. A line from the sentence translations file from Spanish to English.
5.3
Generating a Lesson
When the student clicks on a lesson from the skill page Duolingo generates a personalized
lesson for that student. The generation of that lesson takes as input (1) the skill to be taught, (2)
the lesson number and (3) the student’s history on Duolingo. The algorithm needs to output a
sequence of 20 challenges. There is an easy way to get the words that are being taught in each
lesson given a skill and a lesson number: the first lesson in a skill teaches the first seven words in
the skill, the second lesson teaches the next seven words, etc. Thus, skills that have a lot of words
have more lessons.
There are various requirements for generating lessons. First and foremost is that the students
learn the words or concepts being taught in that lesson. This requires a certain amount of
repetition. At the same time, the lessons should not be too repetitive because then they would
become boring. The algorithm needs to guarantee that the lesson is neither too hard nor too easy.
In addition, we want some variety in the types of challenges that we put in the lessons.
Our algorithm works in two stages: it first computes the possible challenges it can create for
each word that it needs to teach and then uses rules to put them in a good order. In the generation
phase we take the word and find a sentence in our set of seed translations (described in chapter
5.2.) that teaches that word. Once we have the word and the sentence we create possible
challenges from those. For example, we can turn each sentence into a listen challenge or a
translate challenge. We then add these challenges to our set of possible challenges. In the second
phase we use several rules to arrange the possible challenges into a good order. The rules are
direct adaptions of our requirements. For example, there is a rule that says that there should be
no three challenges of the same type consecutively. Another rule says, that every challenge can
only introduce one new word. A third rule says that every word needs to appear at least twice to
guarantee that there is enough repetition.
There are generally two different ways of applying the rules resulting in two different lesson
generators. In the first way, we generate a sequence of 20 challenges and then check if that
sequence passes all the rules. If it does we reject that sequence and generate a new one. We do
this as long as we cannot find a good sequence or we give up. We call this generator an
explorative lesson generator. In the second way, we go one-by-one i.e. we take the current
subsequence of challenges and find a challenge such that the new subsequence passes all the
rules. This is similar to mathematical induction. We call this an iterative lesson generator. While
we started with an explorative lesson generator, we now employ an iterative lesson generator
because it is faster. In general our lesson generator can generate a personalized lesson in under
500ms. Algorithm 1 shows the pseudo-code of the iterative lesson generator.
13
14. Input:
a skill, a lesson number, the student’s learned words,
the seed translations
a sequence of 20 challenges
Output:
Algorithm:
words = skill_words[lesson_number*7, lesson_number*7+7]
possible_challenges = set()
sequence = []
for word in words:
s = find_sentence_that_teaches(word, seed_translations)
possible_challenges.union(gen_possible_challenges(word, s))
for i in range(20):
#find_good_challenge returns a challenge s.t. sequence+[c] satisfies all the rules
c = find_good_challenge(sequence, possible_challenges, students_words)
sequence.append(c)
return sequence
Algorithm 1. Pseudo-code for generating a lesson iteratively.
5.4
Grading the Students Answers
For the lessons we know all the correct answers since all the sentences come from our seed
set of translations. The grading algorithm is different for different challenge types.
For translate challenges we first find the translation that has the smallest Damerau-
Levenshtein edit distance (Damerau, 1964) (Levenshtein, 1966) from the student’s solution. The
Damerau-Levenshtein edit distance is equal to the normal edit distance but counts swaps of
letters (e.g. “house”/“huose”) as only one edit.
Once we have the closest solution, we then compute a full token-based alignment between
the student’s answer and the closest solution. To catch typos, for each matched token, the
student’s answer is allowed to be at most one character off as long as that is not a valid word in
the language that she is typing. For example, “hhouse” would be correct for “house” but
“houses” would not be correct.
For listen challenges we already know what the students are supposed to type so we only
need to compare the student’s answer with that single correct answer. We then proceed by
computing an alignment and catching typos like we did for the translation challenges.
For name challenges, we need to check if the article and the typed word are correct. As there
can be multiple correct words for a bunch of images we first find the closest match and compare
against that one in the same way as with translate challenges.
For form challenges and judge challenges, we only need to check if the student’s answer
matches exactly our correct answer.
14
15. For speak challenges, we run the student’s audio recording through a speech recognition
system and then compare the text with the text they were supposed to speak. If it is close enough
we call it “correct”, else we call it “wrong”. The details of the actual comparison are beyond the
scope of this thesis proposal.
5.5
Giving Detailed Feedback
Whenever possible we try to find the particular mistake that the student made and give her
useful feedback. For example, if the student types “el niña” instead of “la niña” we tell the
student that “niña” is of feminine gender and that the feminine gender requires the article “la”.
We have a similar filter for German. However, due to the grammatical case system in German,
the filter needs to take both case and gender into account. Another peculiarity in German is that
nouns must be capitalized. We built a special filter to catch this kind of mistake. Note that our
filters for German have very high accuracy but currently do not cover all possible cases because
that would require a complete grammatical understanding of the sentences (i.e a parse tree).
Other common mistakes for which we have built specific filters are: missing a word, typing a
wrong word and typos.
Table 3 gives an overview of the feedback filters.
Filter name
gender Languages
Spanish
gender_case German
Description
Detects the wrong use
of grammatical gender
Detects the wrong use
of grammatical gender
or case
Example
El niña come.
Correction
niña is feminine
Er hat ein Apfel.
Use “einen” for
masculine nouns
in nominative
case
casing
German
Detects when a noun is
Er hat einen apfel.
Nouns like
lowercased
“Apfel” are
uppercased in
German
accent/umlaut/eñe
German, Spanish Detects a missing
La nina come.
Pay attention to
accent/umlaut/eñe
the ñ character
missing_word
German, Spanish Detects a missing word Er einen Apfel.
You missed the
word “hat”
wrong_word
German, Spanish Detects a wrong word
Er hatte einen Apfel.
“hatte” is wrong
typo
German, Spanish Detects a typo
Err hat einen Apfel.
“Err” is a typo.
Table 3. An overview of the feedback filters when the correct German sentence is “Er hat einen Apfel.”
and the correct Spanish sentence is “El niña come.”
6 Translations
The translations are the heart and blood of Duolingo. The goal of the translations is two-fold:
(1) make the students learn with real-world examples and (2) translate text from the web. Unlike
the lessons, the correct answers (i.e. the translations of sentences) are generally unknown (if they
15
16. were known we would not have to translate them in the first place). The translations, like the
lessons are reachable from the skill pages (see Figure 5).
Before anything else, we take the corpus of documents that we want to translate (i.e. the web)
and split the documents up into sentences. Whenever a student starts a translation they are being
recommended a single sentence to translate as seen in Figure 14. The students can then proceed
and translate that sentence or practice first. If they click on “practice first”, they will be shown a
short lesson for the highlighted words. The highlighted words are words that our model thinks
the student does not know (or does not know well enough) and thus require practice. This
practice lesson is generated with the lesson generator described in chapter 5.3.
Figure 14. The translation overview page: the student can either practice the highlighted words first and
then translate or skip practice and translate immediately.
6.1
Associating Translations with Students and Skills
As discussed above, the translations are reachable from the skill pages but which translation
sentence ! should we show to student ! on skill page !? Our current algorithm for finding a
sentence looks at all the sentences that include words taught in skill ! and then looks which one
of those sentences is of the right difficulty for student !. The hard part is estimating the sentence
difficulty. Therefore, a core component of Duolingo is the sentence difficulty estimator that we
discuss in chapter 6.2.
16
17. 6.2
Estimating Sentence Difficulty
Obviously there is no global sentence difficulty i.e. the same sentence is harder for beginners
than it is for advanced students. Thus we model sentence difficulty by looking at ! !"##$!% (!, !),
the probability that the student ! gets sentence ! correct. We look at two different kinds of
features for building our model: (1) student features and (2) sentence features.
Student features
number of skill points
number of days since last activity
native language
Sentence features
number of words
number of characters
number of pronouns
number of nouns
number of verbs
Table 4. Some student and sentence features for estimating ! !"##$!% (!, !)
Table 4 shows some of the student and sentence features that we use in our current estimator.
The estimator uses logistic regression and was trained using a set of training data. We consider a
sentence of the right difficulty if ! !"##$!% (!, !) is within certain bounds.
6.3
Translating Sentences and Rating Translations
Once the students start a translation, they see a screen as in Figure 15. The student sees the
recommended sentence and is asked to translate the sentence from the learning language (here
Spanish) to her native language (here English). When the student hovers over a word we show
context-sensitive hints for that word. Our hints are so precise, that with their help even total
beginners (i.e. monolinguals) can translate most sentences correctly without any problems. Once
the sentence is entered we grade it immediately by comparing it with the correct translations
from other students for the same sentence (see chapter 6.4). If the agreement with other correct
solutions is high enough we accept the student’s answer as correct, otherwise we say it is wrong.
17
18. Figure 15. Translating a single sentence from the web. When the student hovers a word we show context-
sensitive hints. After entering a sentence, we compute the percentage agreement between the student’s
solution and other correct solutions (here: 77% agreement)
After that, the student sees a screen as in Figure 16. In this screen the student grades the
translation for the same sentence from a different students. Because the student just translated
this sentence, the sentence and its meaning are still in good memory and the student is able to tell
us if this other translation is good or not. The ratings are necessary for grading other students but
also to compile the final translation of a document. After the student has both translated the
sentence and given us a rating for a translation from a different student, the student can continue
to translate the next (or any other) sentence of the document or go back to the skill page.
18
19. Figure 16. Rating a translation from a different student. The scale is {bad, medium, very good}.
6.4
Grading Translations
When we want to grade a translation we first compute a set of correct translations from all
the translations the students have entered by looking at the ratings those translations have
received from other students. We do this by converting the ratings from students into ups and
downs. We treat bad and medium as down votes and a very good as an upvote. We then view the
ups and downs for each candidate translation as draws from a binomial distribution !(!, !). We
compute the Wilson 90% confidence interval for !. Finally, we define a candidate translation as
correct if the lower bound ! is greater than 50%. In other words, under these assumptions the
probability that the true ! is smaller than 50% given that we say the translation is correct is less
than 10%. When, under these conditions there exists no correct translation, we define the
currently highest-rated (i.e. the one having the highest lower bound for !) two sentences as the
best translations. If there is not yet any rating, we define the machine translation as the correct
translation.
Once we have the set of correct translations we compute the METEOR (Banerjee & Lavie,
2005) score between the student’s text and the set of correct translations and return that as our
grade (“agreement with other correct translations”) to the student as seen in Figure 15.
6.5
Compiling Final Translations
We say a translation of a document is done when we have a correct translation for each
sentence i.e. when for every sentence we have a translation whose lower bound for ! is larger
than 0.5. Once a document is done we compile a final translation by going through each sentence
19
20. and picking the sentence with the highest lower bound for ! as the final translation for that
sentence.
7 Tests
So far we assumed that the students start with no prior knowledge of the language that they
are learning. This is unrealistic. So there must be a way for more advanced students to go
through Duolingo at a faster rater. For those students, we developed the tests. For each skill, the
students can take a test. If they pass the test, they get immediately all the skill points for that skill
(i.e. the skill becomes mastered) and skills that are below that skill in the skill tree get unlocked.
In total, the students have three tries for each skill. If they fail three times, they must take the
long route and do the lessons.
The tests are generated with a generator very similar to the lesson generator described in
chapter 5.3. Instead of picking the words in a consecutive way as in the lessons, we pick seven
words randomly and then continue the algorithm exactly as in the lessons. There are no hints in
tests.
8 Practice
Since people easily forget vocabulary, Duolingo has a feature to practice words that the
students have forgotten or are about to forget. The current algorithm finds the words that the
student has most likely forgotten by looking when words were last seen by the student. It then
creates a practice session that teaches those words using a generator similar to the lesson
generator. Duolingo advises students to practice at least once a day. When Duolingo notices a
lack of practicing and the student has turned on practice reminders, Duolingo sends the student
an email reminder. Every practice session gives 10 skill points but it does not unlock new skills.
9 Context-Sensitive Dictionary Hints
As seen in Figure 15, Duolingo includes an advanced dictionary module for providing
context-sensitive hints for every word that the student hovers over. Among other features, our
system can translate conjugated verbs in context, common phrases of multiple words, German
compound nouns and German separable verbs. In German, there are compound nouns that
translate into multiple words in English, for example “Mannschafts|bus” (team bus). Another
peculiarity of German are separable verbs. In German, a verb can split into two parts, a main part
that stands behind the subject and a preposition at the end of the sentence. We found that
students get confused by separable verbs because they do not understand the role of the little
preposition at the end of the sentence. However, our advanced dictionary module can highlight
both parts when the students hover over either part to make it clear that these words belong
together and we can give the correct dictionary translation as well. Table 5 shows some example
sentences and hints.
20
21. Feature(s)
conjugation
multi-word hints, conjugation
multi-word hints, conjugation
multi-word hints, phrases
German compound word
German compound word Example
¿Tú duermes entre ellos?
De Florencia nos cuentan, …
Disculpe, si me duermo.
…, en el que hubo …
Donnerstag|morgen
Flug|feld|kontrolleur
German separable verbs Er kommt heute an.
First hint for blue word(s)
(you) sleep
(they/you-plural) tell (us)
(I) sleep
where
Thursday morning
flight field inspector (airfield
inspector)
(he/she/it/you-plural) arrives
Table 5. Features of the dictionary module
10 Research Plan
Since Duolingo basically tries to achieve two goals at the same time we must scientifically
evaluate it with respect to these two goals. The first main goal of Duolingo is to translate text.
From data that we gathered in the private beta, we are very sure that the translations produced by
Duolingo are excellent. Nevertheless, we want to evaluate the translations produced by Duolingo
with existing metrics, both automated (BLEU (Papineni, Roukos, Ward, & Wei-Jing, 2002),
METEOR (Banerjee & Lavie, 2005)) and manual (rating scale) and compare them against
translations produced by machine translation and professional translators. We also want to figure
out the translation capacity of Duolingo, i.e. how many sentences we can translate per second
per active user. This leads to the following two hypotheses:
Hypothesis 1. The translations produced by Duolingo are as good as those produced by
professional translators.
Hypothesis 2. The translation capacity is high enough to translate the web in reasonable
time.
The second main goal of Duolingo is to teach students a foreign language. We want to figure
out if students actually do learn a foreign language with Duolingo. Anecdotal evidence shows
that they do. However, we want to evaluate Duolingo with established tests such as the CAPE
(Computer-Adaptive Placement Exam) (Madsen, 1991) test, TOEFL (Test of English as a
Foreign Language) and ACTFL (The American Council on the Teaching of Foreign Languages)
OPIc (Oral Proficiency Interviews) and compare it against existing methods, both from the off-
and online world. This leads to our third hypothesis:
Hypothesis 3. Students learn with Duolingo as well as with comparable language learning
products.
A third goal is to derive a theoretical model of student behavior on the site. In particular, we
want to develop three models: (a) a knowledge model that predicts which challenges a student
21
22. can answer correctly at any given point in time (similar to the difficulty estimator described in
chapter 6.2), (b) a learning model that predicts how the student’s (unobservable) mental state
updates after a certain action on the site, (c) a motivational model that predicts if and when a
particular student will come back to the site. These lead to hypotheses 4-6.
Hypothesis 4. We can build a model that predicts with high accuracy which challenges a
student can answer correctly.
Hypothesis 5. We can build a model that predicts with high accuracy how the students’
actions affect their learning outcomes.
Hypothesis 6. We can build a model that predicts with high accuracy if and when students
return to the site.
Finally we want to figure out, how the last three goals can help us in improving the site for
both learning and translation. In particular, a long-term vision for Duolingo is to build a system
that learns how people learn. Thus, Duolingo really is at the center of the intersection of human
learning and machine learning.
11 References
Amazon. (n.d.). Amazon Mechanical Turk. Retrieved from http://www.mturk.com
Banerjee, S., & Lavie, A. (2005). METEOR: An Automatic Metric for MT Evaluation with
Improved Correlation with Human Judgments. Proceedings of Workshop on Intrinsic and
Extrinsic Evaluation Measures for MT and/or Summarization at the 43rd Annual Meeting of the
Association of Computational Linguistics (ACL-2005). Ann Arbor, Michigan.
Callison-Burch, C. (2009). Fast, cheap, and creative: evaluating translation quality using
Amazon's Mechanical Turk. Proceedings of the Conference on Empirical Methods in Natural
Language Processing. Stroudsburg.
D Sleeman, J. S. (1982). Intelligent Tutoring Systems. Science , 456-462.
Damerau, F. J. (1964). A technique for computer detection and correction of spelling errors.
Communications of the ACM , 7 (3), p. 171.
Gamper, J., & Knapp, J. (2002). A Review of Intelligent CALL Systems. Computer Assisted
Language Learning , pp. 329-342.
Hirsh, D., & Nation, P. (1992). What Vocabulary Size is Needed to Read Unsimplified Texts
for Pleasure. Reading in a Foreign Language , 689-696.
Levenshtein, V. (1966). Binary codes capable of correcting deletions, insertions, and
reversals. Soviet Physics Doklady , pp. 707-710.
Madsen, H. S. (1991). Computer-adaptive testing of listening and reading comprehension.
Computer-assisted language learning and testing: Research issues and practice , pp. 237-257.
Meier, S., & Bruce, S. (1991). Civilization. MicroProse.
22
23. Papineni, K., Roukos, S., Ward, T., & Wei-Jing, Z. (2002). BLEU: a Method for Automatic
Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association
for Computational Linguistics (ACL), (pp. 311-318). Philadelphia.
Romaine, M., & Richardson, J. (2009). Translation Industry Report 2009. Retrieved April 4,
2012, from mygengo.com: http://mygengo.com/express/report/translation-industry-2009
Shaalan, K. F. (2005, February). An Intelligent Computer Assisted Language Learning
System for Arabic Learners. Computer Assisted Language Learning , pp. 81-108.
von Ahn, L., Maurer, B., McMillen, C., Abraham, D., & Blum, M. (2008). reCAPTCHA:
Human-Based Character Recognition via Web Security Measures. Science , 1465-1468.
Zaidan, O. F., & Callison-Burch, C. (2011). Crowdsourcing Translation: Professional Quality
from Non-Professionals. Proceedings of ACL. Portland.
23