Duolingo- Learning a Language while Translating the Web

如果无法正常显示，请先停止浏览器的去广告插件。

1. Duolingo: Learning a Language while Translating the Web Severin Hacker, Carnegie Mellon University 1 Abstract We built a new system that translates text cheaply, accurately and quickly. It achieves this by splitting the translation work into smaller tasks and crowdsourcing those tasks to people who want to learn a foreign language. In our system the students learn a foreign language while simultaneously translating text. From the student’s perspective, the system looks and behaves like a normal language learning website. Duolingo is an example of human computation, where computer systems enable massive collaborations between humans and computers for the benefit of humanity. 2 Introduction Our work starts with a simple question: how can we translate the web into every major language? Translation of text is expensive, slow or inaccurate. Translation by professional human translators is accurate but costly and slow. Current cost estimates for professional translation range from 5 cents a word to 35 cents a word, depending on the desired quality, the language pair and the nature of the content (Romaine & Richardson, 2009). A professional translator can only translate around 2’500 words per day. On the other hand, translation by machines is fast and cheap but inaccurate (yet). To this day there exists no way of translating text cheaply, quickly and accurately. Duolingo aims to fix this problem. Translation by human translators is expensive because the translators are professionally trained and because they must understand two languages well. It is generally assumed that a sound knowledge of the two languages is required in order to translate accurately. Unfortunately, there are not that many people in the world who speak two languages well. Therefore our next question is: how can we translate the web with people who speak one language only? A very simple experiment in 2008 showed us that this, as paradoxical as it may seem, can be done. In this experiment, we gave the participants a short fragment of a news article written in German and we asked them to translate it into English. In addition, we helped them by giving them multiple dictionary translations for each word in the sentence (see Figure 1). While not every participant could translate every sentence correctly, when we compiled the best translations for each sentence into a document, that final document was a very accurate translation of the original. This experiment showed us that we can translate complex sentences with people who know only a single language. But how can we motivate millions of people to translate? How can we motivate people to do something for free that so far was a task professionals were paid to do? The answer is: we help the people learn a language while they translate. This is the core idea of Duolingo: you learn a 1

2. foreign language while you simultaneously translate text. This duality is also the reason for the name "duo-lingo". Figure 1. English speakers who do not know any German can successfully translate this German sentence (underlined text) when each word is shown with all its possible English translations. Subjects correctly translated the last phrase as “the straw that broke the camel’s back” even though the German equivalent is roughly “the drop that made the barrel overflow.” 2.1 Related Work Duolingo is part of a new research area, called human computation. The goal of human computation is to develop theories and build computer systems that enable massive collaborations between humans and computers for the benefit of humanity. One example of human computation is reCAPTCHA (von Ahn, Maurer, McMillen, Abraham, & Blum, 2008), a service that digitizes books by piggy-packing on a CAPTCHA. A CAPTCHA is a task that humans can do but machines cannot; it is a security measure to prevent automated programs from abusing web services. A reCAPTCHA is a special CAPTCHA which plays a double role: (1) it guarantees the website owner that the solver of the reCAPTCHA is a human being and (2) by aggregating the human solvers’ answers it helps digitizing books. With reCAPTCHA hundreds of millions of people help transcribing books when they sign up for new web services. Crowdsourcing and human computation are more less the same concept: in crowdsourcing, too, a task is outsourced to a large group of people. The classic example for crowdsourcing is Amazon Mechanical Turk (Amazon). Several researchers (Zaidan & Callison-Burch, 2011), (Callison-Burch, 2009) have used mechanical turk for the tasks of machine translation and machine translation evaluation. They report that for both tasks non-experts, when combined, achieve expert-level results. Since Duolingo is at the same time also an online, adaptive language tutoring system. It is related to research in Intelligent Tutoring Systems (ITS) (D Sleeman, 1982). Although, most ITS 2

3. were written for the context of teaching mathematics and problem-solving, there are various systems written for the purpose of teaching a first or second natural language. For example (Shaalan, 2005) mentions a computer-assisted language learning system for learning Arabic. (Gamper & Knapp, 2002) give an overview of current systems and a concise classification scheme. 3 System Overview While students in Duolingo learn a foreign language and simultaneously translate text, not all aspects of a language can be taught with translations. For those other aspects, we developed tailor-made lessons. The lessons are manually created learning units that teach new grammatical concepts and new vocabulary through interactive exercises called challenges. These challenges cover all aspects of a language: writing, reading, listening and speaking. As a result Duolingo is organized into two main areas: the lessons and the translations. The lessons’ purpose is to teach the students new vocabulary and new grammatical concepts. With the translations, the students practice the newly acquired knowledge by translating text from the real world. While we know all the correct answers for the lessons we do not know the correct answers for the translations beforehand. Therefore, when we grade a student’s answer we can do so perfectly for the lessons but only approximately for the translations. All the underlying data for the lessons is manually created whereas the content for the translations comes from the Internet. We summarized these differences between lessons and translations in Table 1. teaches correct answer known translating web content data source Lessons vocabulary, grammar yes no manual Translations real-world usage no yes web Table 1. Comparing Lessons and Translations In Duolingo, the lessons and translations are grouped into skills. A skill is a list of words. To learn a skill means to learn the words in that list. Each skill has multiple lessons and multiple translations associated to it. In the lessons that are associated with a skill we teach a subset of the words in that skill. For example, the first lesson may teach the first seven words in that skill, the second lesson the next seven words, etc. The translations that are associated to a specific skill contain words that are also in that skill. Currently, Duolingo teaches three languages: English, Spanish and German. The languages were chosen by multiple criteria: (1) US demand, (2) world-wide demand, (3) languages spoken by team members, (4) similarity to Western European languages. Table 2 lists the most frequently learned languages in the US in 2006. Since in Duolingo students learn by translating text we actually have to look at the learning directions. For example, currently, you can learn German or Spanish if English is your native language. You can also learn English if Spanish is 3

4. your native language. However, we currently do not support the directions “learn English if German is the language you know” nor “learn German if Spanish is the language you know”, etc. We call the language that you are learning the learning language or the “to language”, and the language that you are native in, the “from language”. Our goal is to add the directions French (when you know English), English (when you know Portuguese) and English (when you know Chinese) in 2012. Language % of enrollment Spanish 52.2 French 13.1 German 6 Italian 5 Japanese 4.2 Chinese 3.3 Table 2. Most frequently learned languages in the US in 2006. When students do a translation from the web, they always translate from their learning language to their native language. Because native speakers of a language write all sentences we expect the translations to be fluent. Lessons Translations Skills Figure 2. System Overview: The Lessons and Translations are built on top of the skills 4 Skills The Duolingo skill system is our answer to the question: what should I learn (next)? In Duolingo we teach 3,000 words. This is based on research that shows that if you have an active vocabulary of 3,000 words you are fluent enough to have a basic conversation and you are able to read a newspaper. For example (Hirsh & Nation, 1992) mention that with a vocabulary size of 2,000 words the reader will be familiar with 90-92% of the words in three novels. We chose the 3,000 words by taking the most frequently used words in the written and spoken language. We then take those 3,000 words and split them up into distinct lists of words. Each list of words we call a skill. Figure 3 shows some of the words being taught in the skill Basics 1 for Spanish. The idea of the skills is to group words together that should be taught together. There are three kinds of skills: (1) introductory skills, (2) vocabulary skills and (3) grammar skills. For example, Basics 1 is an introductory skill. The goal of the introductory skills is to introduce very 4

5. basic words like the personal pronouns I, you, she that are necessary to get started. The goal of the vocabulary skills is to teach new vocabulary. For example, the skill Food in Spanish teaches various words related to food, like: wine, drink, eat. Finally, the goal of the grammar skills is to teach new grammatical concepts like plurals or the past tense. We manually created between 40 and 50 skills for the languages we want to teach. Each skill has a fixed number of skill points that students gain when they learn the skill. The maximum number of skill points depends on the number of words that are being taught in a skill. For example, the skill Basics 1 in Spanish teaches 27 words and students can gain 96 skill points. For each skill, there are two thresholds: (1) the learning threshold and (2) the mastered threshold. The learning threshold is set manually for each skill. When the student has more skill points than the learning threshold we say that the student learned that skill. The mastered el/el<det><def><m><sg> hombre/hombre<n><m><sg> la/el<det><def><f><sg> mujer/mujer<n><f><sg> yo/prpers<prn><tn><p1><mf><sg> un/uno<det><ind><m><sg> una/uno<det><ind><f><sg> soy/ser<vbser><pri><p1><sg> niño/niño<n><m><sg> niña/niño<n><f><sg> él/prpers<prn><tn><p3><m><sg> Figure 3. Some of the words being taught in Basics 1 together with their grammatical forms threshold is equal to the maximum number of skill points for that skill. When the student has reached the maximum number of skill points, we say that the student mastered that skill. The skills are arranged in a skill tree where each node is a skill (a list of words) and there exists an edge from a skill A to a skill B if the students need to learn B before they can learn A, as described in Figure 4. Learning a skill means learning the list of words associated to the skill. Thus, the skill tree enforces an order on what the students can learn at any given time. Figure 6 shows an example of a skill tree. For example in the skill tree in Figure 6, after the students have learned the skill Basics 1 they can continue with Basics 2 or Common Phrases. A B Figure 4. There is an edge from skill B to skill A if you need to learn A before you can learn B The skill tree not only gives the learning process a clear structure it also allows the students to “unlock” new skills. This process of “unlocking” of skills is very motivational and is inspired 5

6. by technology trees in many successful strategy games such as civilization (Meier & Bruce, 1991). When the students log in to Duolingo, the first thing they see is the skill tree as shown in Figure 6. We arranged the skills manually into a skill tree by using textbooks as a source of inspiration. Each skill has its own page in Duolingo. The skill page appears when the students click on a skill in the skill tree. On the skill page the students can (1) do a lesson, (2) do a translation or (3) take a test. Figure 5. The skill page for Basics 1. The students can do a lesson (on blue background), do a translation (on dark green background) or take a test (top right). 6

7. Figure 6. An example of a skill tree. In this skill tree, the skill Plurals can only be learned after the skills Phrases, Food, Animals have been learned. 5 Lessons The goal of the lessons is to teach new vocabulary and new grammar concepts. Unlike the translations, the correct answer is always known beforehand. Also, when the students do the lessons they do not translate content from the web. Each lesson is a sequence of 20 challenges. A challenge is a mini learning task. There are 6 different challenge types: name, listen, translate, form, speak and judge. We will discuss the individual challenge types in the next chapter. At the start of every lesson the students have three hearts. Whenever they make a mistake they lose one heart. When they have no more hearts and they make another mistake they fail the lesson. When the students complete the lesson successfully, they receive 10 skill points. 7

8. 5.1 Challenges 5.1.1 Name Challenge In the name challenge the student is shown four pictures and she has to say what she sees in the pictures. The name challenges always ask for a single noun like “boy” or “woman”. For languages that have the concept of a grammatical gender like German and Spanish the student also has to pick the correct article (e.g in Spanish “el” for masculine nouns and “la” for feminine nouns). Figure 7 shows an example of a name challenge in Spanish. With the name challenge the students improve their writing skills. Figure 7. In the name challenge the student has to type the word that corresponds to the four pictures. 5.1.2 Translate Challenge In the translate challenge the student sees a sentence in a source language and she has to translate that sentence to a target language. There are two kinds of translate challenges depending on which language is source and target languages: (1) (normal) translate challenges, where the learning language is equal to the source language and the target language is equal to the student’s native language and (2) reverse translate challenges where the source language is equal to the student’s native language and the target language is equal to the learning language. Clearly, the reverse translate challenges are much harder than the normal translate challenges because the student has to compose correct sentences in a foreign language. Error! Reference source not found. shows an example of a translate challenge. With the translate challenges the students improve their ability to read, write and understand the language that they are learning. When the students hover over a word we give them context-sensitive translations of that word. We call these context-sensitive translations hints. Our hints are usually very accurate and we will see their use again in the translations in chapter 6 and their capabilities and implementation in chapter 9. 8

9. Figure 8. In the (normal) translate challenge the student has to translate a single sentence. The student’s translation is graded automatically by comparing it with a list of correct translations. If it matches one of the correct translations we say that the translation is correct, else it is wrong. However, our matching is lenient in the sense that we accept certain kind of typos and minor mistakes (like “a”/”an”). The correct translations come from our seed set that we will discuss in chapter 5.2. 5.1.3 Judge Challenge In the judge challenge the student sees a sentence and has to mark the correct translations from a list of candidate translations. More than one translation can be correct and the student has to select all of them. Figure 9 shows an example of a judge challenge. The correct answers again come from our seed set. With the judge challenge we can highlight small but important differences such as the relationship between subjects and objects; thus the students improve their reading and writing skills. 9

10. Figure 9. In the judge challenge the student has to mark the correct translations from 3 candidates. 5.1.4 Listen Challenge In the listen challenge (see Figure 10), the student listens to a short audio clip and is asked to type the sentence she hears. The sentence is spoken in the student’s learning language and should also be typed in the student’s learning language. Listen challenges are among the harder challenges in a lesson. With the listen challenge the students improve their ability to understand the spoken language. Currently, we take a sentence from our seed set and run it through text-to-speech synthesis (TTS) and then present it to the student. This way, we always know the correct answer and can grade the student’s answer perfectly. Figure 10. In the listen challenge the student hears a short sound clip and has to type what she heard. 10

11. 5.1.5 Speak Challenge In the speak challenge (see Figure 11), the student speaks a given sentence into his microphone and the system grades the pronunciation. The system runs the utterance through a speech recognition system and then compares the output to the reference sentence. The grading either returns “correct” or “wrong”. Due to imperfections of the speech recognition system, the students, unlike in the other challenge types, always get another chance when they fail a speak challenge. Students who do not have a microphone can turn off speak challenges altogether. With the speak challenge we improve the student’s ability to pronounce phrases in the language that they are learning. Figure 11. In the speak challenge the student has to speak a short phrase into the microphone. 5.1.6 Form Challenge In the form challenge (see Figure 12) the student picks a word from a list of words so that the resulting sentence is correct. Only one option is correct. With the form challenge we can teach certain small differences of words like conjugations of verbs, gender agreement of adjectives, etc. 11

12. Figure 12. In the form challenge the student has to pick the correct word from a list. 5.2 The Seed Translations For each language direction that we teach (e.g. “learning Spanish when English is your native language”) we have two files with correct translations. These files are manually created. Figure 13 shows a single line from the sentence translations file from English to Spanish. The bolded word is the surface form of the word that is being tested with the sentence in blue color. The text in green is the lexeme together with its lexical description. In this case, the lexical description means that “están” is the conjugated form of the verb “estar” in the 3 rd person plural in the present tense. The lexical description helps us to disambiguate words that have the same surface form but different syntactic meaning. For example, in English the surface form “sleep” can both be a verb in infinitive form (“He is going to sleep.”) or a conjugated verb (“We sleep.”). The sentence in blue color is the example sentence for “están”. Note that the word “están” must occur in the example sentence in exactly the form described by the green lexical form and its bolded surface form. The words surrounded by curly braces are used for the form challenge. The first word is the correct word (in this case “están”) and the other words (estoy/está/estás/estamos) are wrong alternatives for “están”. The sentence in red color is the display translation. This is what we consider the best translation of the sentence blue. Whenever a student types a wrong translation we show the display translation as the correct solution. If there is more than one display translation we show all of them. The translations in brown color are our accept translations. When a student types an accept translation we treat the student’s translation as a correct translation. However, the students generally do not see the accept translations in the user interface. For each direction we need about 18,000 lines so that we can cover all the words introduced in the skill files. Our internal goal is to have about three sentences for each lexeme that we teach. están/estar<vblex><pri><p3><pl>||Hola, ¿cómo {están//estoy/está/estás/estamos} ustedes?||Hello, how are you all?||[Hello/Hi], how are [you/you all/y'all]? 12

13. Figure 13. A line from the sentence translations file from Spanish to English. 5.3 Generating a Lesson When the student clicks on a lesson from the skill page Duolingo generates a personalized lesson for that student. The generation of that lesson takes as input (1) the skill to be taught, (2) the lesson number and (3) the student’s history on Duolingo. The algorithm needs to output a sequence of 20 challenges. There is an easy way to get the words that are being taught in each lesson given a skill and a lesson number: the first lesson in a skill teaches the first seven words in the skill, the second lesson teaches the next seven words, etc. Thus, skills that have a lot of words have more lessons. There are various requirements for generating lessons. First and foremost is that the students learn the words or concepts being taught in that lesson. This requires a certain amount of repetition. At the same time, the lessons should not be too repetitive because then they would become boring. The algorithm needs to guarantee that the lesson is neither too hard nor too easy. In addition, we want some variety in the types of challenges that we put in the lessons. Our algorithm works in two stages: it first computes the possible challenges it can create for each word that it needs to teach and then uses rules to put them in a good order. In the generation phase we take the word and find a sentence in our set of seed translations (described in chapter 5.2.) that teaches that word. Once we have the word and the sentence we create possible challenges from those. For example, we can turn each sentence into a listen challenge or a translate challenge. We then add these challenges to our set of possible challenges. In the second phase we use several rules to arrange the possible challenges into a good order. The rules are direct adaptions of our requirements. For example, there is a rule that says that there should be no three challenges of the same type consecutively. Another rule says, that every challenge can only introduce one new word. A third rule says that every word needs to appear at least twice to guarantee that there is enough repetition. There are generally two different ways of applying the rules resulting in two different lesson generators. In the first way, we generate a sequence of 20 challenges and then check if that sequence passes all the rules. If it does we reject that sequence and generate a new one. We do this as long as we cannot find a good sequence or we give up. We call this generator an explorative lesson generator. In the second way, we go one-by-one i.e. we take the current subsequence of challenges and find a challenge such that the new subsequence passes all the rules. This is similar to mathematical induction. We call this an iterative lesson generator. While we started with an explorative lesson generator, we now employ an iterative lesson generator because it is faster. In general our lesson generator can generate a personalized lesson in under 500ms. Algorithm 1 shows the pseudo-code of the iterative lesson generator. 13

14. Input: a skill, a lesson number, the student’s learned words, the seed translations a sequence of 20 challenges Output: Algorithm: words = skill_words[lesson_number*7, lesson_number*7+7] possible_challenges = set() sequence = [] for word in words: s = find_sentence_that_teaches(word, seed_translations) possible_challenges.union(gen_possible_challenges(word, s)) for i in range(20): #find_good_challenge returns a challenge s.t. sequence+[c] satisfies all the rules c = find_good_challenge(sequence, possible_challenges, students_words) sequence.append(c) return sequence Algorithm 1. Pseudo-code for generating a lesson iteratively. 5.4 Grading the Students Answers For the lessons we know all the correct answers since all the sentences come from our seed set of translations. The grading algorithm is different for different challenge types. For translate challenges we first find the translation that has the smallest Damerau- Levenshtein edit distance (Damerau, 1964) (Levenshtein, 1966) from the student’s solution. The Damerau-Levenshtein edit distance is equal to the normal edit distance but counts swaps of letters (e.g. “house”/“huose”) as only one edit. Once we have the closest solution, we then compute a full token-based alignment between the student’s answer and the closest solution. To catch typos, for each matched token, the student’s answer is allowed to be at most one character off as long as that is not a valid word in the language that she is typing. For example, “hhouse” would be correct for “house” but “houses” would not be correct. For listen challenges we already know what the students are supposed to type so we only need to compare the student’s answer with that single correct answer. We then proceed by computing an alignment and catching typos like we did for the translation challenges. For name challenges, we need to check if the article and the typed word are correct. As there can be multiple correct words for a bunch of images we first find the closest match and compare against that one in the same way as with translate challenges. For form challenges and judge challenges, we only need to check if the student’s answer matches exactly our correct answer. 14

15. For speak challenges, we run the student’s audio recording through a speech recognition system and then compare the text with the text they were supposed to speak. If it is close enough we call it “correct”, else we call it “wrong”. The details of the actual comparison are beyond the scope of this thesis proposal. 5.5 Giving Detailed Feedback Whenever possible we try to find the particular mistake that the student made and give her useful feedback. For example, if the student types “el niña” instead of “la niña” we tell the student that “niña” is of feminine gender and that the feminine gender requires the article “la”. We have a similar filter for German. However, due to the grammatical case system in German, the filter needs to take both case and gender into account. Another peculiarity in German is that nouns must be capitalized. We built a special filter to catch this kind of mistake. Note that our filters for German have very high accuracy but currently do not cover all possible cases because that would require a complete grammatical understanding of the sentences (i.e a parse tree). Other common mistakes for which we have built specific filters are: missing a word, typing a wrong word and typos. Table 3 gives an overview of the feedback filters. Filter name gender Languages Spanish gender_case German Description Detects the wrong use of grammatical gender Detects the wrong use of grammatical gender or case Example El niña come. Correction niña is feminine Er hat ein Apfel. Use “einen” for masculine nouns in nominative case casing German Detects when a noun is Er hat einen apfel. Nouns like lowercased “Apfel” are uppercased in German accent/umlaut/eñe German, Spanish Detects a missing La nina come. Pay attention to accent/umlaut/eñe the ñ character missing_word German, Spanish Detects a missing word Er einen Apfel. You missed the word “hat” wrong_word German, Spanish Detects a wrong word Er hatte einen Apfel. “hatte” is wrong typo German, Spanish Detects a typo Err hat einen Apfel. “Err” is a typo. Table 3. An overview of the feedback filters when the correct German sentence is “Er hat einen Apfel.” and the correct Spanish sentence is “El niña come.” 6 Translations The translations are the heart and blood of Duolingo. The goal of the translations is two-fold: (1) make the students learn with real-world examples and (2) translate text from the web. Unlike the lessons, the correct answers (i.e. the translations of sentences) are generally unknown (if they 15

16. were known we would not have to translate them in the first place). The translations, like the lessons are reachable from the skill pages (see Figure 5). Before anything else, we take the corpus of documents that we want to translate (i.e. the web) and split the documents up into sentences. Whenever a student starts a translation they are being recommended a single sentence to translate as seen in Figure 14. The students can then proceed and translate that sentence or practice first. If they click on “practice first”, they will be shown a short lesson for the highlighted words. The highlighted words are words that our model thinks the student does not know (or does not know well enough) and thus require practice. This practice lesson is generated with the lesson generator described in chapter 5.3. Figure 14. The translation overview page: the student can either practice the highlighted words first and then translate or skip practice and translate immediately. 6.1 Associating Translations with Students and Skills As discussed above, the translations are reachable from the skill pages but which translation sentence ! should we show to student ! on skill page !? Our current algorithm for finding a sentence looks at all the sentences that include words taught in skill ! and then looks which one of those sentences is of the right difficulty for student !. The hard part is estimating the sentence difficulty. Therefore, a core component of Duolingo is the sentence difficulty estimator that we discuss in chapter 6.2. 16

17. 6.2 Estimating Sentence Difficulty Obviously there is no global sentence difficulty i.e. the same sentence is harder for beginners than it is for advanced students. Thus we model sentence difficulty by looking at ! !"##$!% (!, !), the probability that the student ! gets sentence ! correct. We look at two different kinds of features for building our model: (1) student features and (2) sentence features. Student features number of skill points number of days since last activity native language Sentence features number of words number of characters number of pronouns number of nouns number of verbs Table 4. Some student and sentence features for estimating ! !"##$!% (!, !) Table 4 shows some of the student and sentence features that we use in our current estimator. The estimator uses logistic regression and was trained using a set of training data. We consider a sentence of the right difficulty if ! !"##$!% (!, !) is within certain bounds. 6.3 Translating Sentences and Rating Translations Once the students start a translation, they see a screen as in Figure 15. The student sees the recommended sentence and is asked to translate the sentence from the learning language (here Spanish) to her native language (here English). When the student hovers over a word we show context-sensitive hints for that word. Our hints are so precise, that with their help even total beginners (i.e. monolinguals) can translate most sentences correctly without any problems. Once the sentence is entered we grade it immediately by comparing it with the correct translations from other students for the same sentence (see chapter 6.4). If the agreement with other correct solutions is high enough we accept the student’s answer as correct, otherwise we say it is wrong. 17

18. Figure 15. Translating a single sentence from the web. When the student hovers a word we show context- sensitive hints. After entering a sentence, we compute the percentage agreement between the student’s solution and other correct solutions (here: 77% agreement) After that, the student sees a screen as in Figure 16. In this screen the student grades the translation for the same sentence from a different students. Because the student just translated this sentence, the sentence and its meaning are still in good memory and the student is able to tell us if this other translation is good or not. The ratings are necessary for grading other students but also to compile the final translation of a document. After the student has both translated the sentence and given us a rating for a translation from a different student, the student can continue to translate the next (or any other) sentence of the document or go back to the skill page. 18

19. Figure 16. Rating a translation from a different student. The scale is {bad, medium, very good}. 6.4 Grading Translations When we want to grade a translation we first compute a set of correct translations from all the translations the students have entered by looking at the ratings those translations have received from other students. We do this by converting the ratings from students into ups and downs. We treat bad and medium as down votes and a very good as an upvote. We then view the ups and downs for each candidate translation as draws from a binomial distribution !(!, !). We compute the Wilson 90% confidence interval for !. Finally, we define a candidate translation as correct if the lower bound ! is greater than 50%. In other words, under these assumptions the probability that the true ! is smaller than 50% given that we say the translation is correct is less than 10%. When, under these conditions there exists no correct translation, we define the currently highest-rated (i.e. the one having the highest lower bound for !) two sentences as the best translations. If there is not yet any rating, we define the machine translation as the correct translation. Once we have the set of correct translations we compute the METEOR (Banerjee & Lavie, 2005) score between the student’s text and the set of correct translations and return that as our grade (“agreement with other correct translations”) to the student as seen in Figure 15. 6.5 Compiling Final Translations We say a translation of a document is done when we have a correct translation for each sentence i.e. when for every sentence we have a translation whose lower bound for ! is larger than 0.5. Once a document is done we compile a final translation by going through each sentence 19

20. and picking the sentence with the highest lower bound for ! as the final translation for that sentence. 7 Tests So far we assumed that the students start with no prior knowledge of the language that they are learning. This is unrealistic. So there must be a way for more advanced students to go through Duolingo at a faster rater. For those students, we developed the tests. For each skill, the students can take a test. If they pass the test, they get immediately all the skill points for that skill (i.e. the skill becomes mastered) and skills that are below that skill in the skill tree get unlocked. In total, the students have three tries for each skill. If they fail three times, they must take the long route and do the lessons. The tests are generated with a generator very similar to the lesson generator described in chapter 5.3. Instead of picking the words in a consecutive way as in the lessons, we pick seven words randomly and then continue the algorithm exactly as in the lessons. There are no hints in tests. 8 Practice Since people easily forget vocabulary, Duolingo has a feature to practice words that the students have forgotten or are about to forget. The current algorithm finds the words that the student has most likely forgotten by looking when words were last seen by the student. It then creates a practice session that teaches those words using a generator similar to the lesson generator. Duolingo advises students to practice at least once a day. When Duolingo notices a lack of practicing and the student has turned on practice reminders, Duolingo sends the student an email reminder. Every practice session gives 10 skill points but it does not unlock new skills. 9 Context-Sensitive Dictionary Hints As seen in Figure 15, Duolingo includes an advanced dictionary module for providing context-sensitive hints for every word that the student hovers over. Among other features, our system can translate conjugated verbs in context, common phrases of multiple words, German compound nouns and German separable verbs. In German, there are compound nouns that translate into multiple words in English, for example “Mannschafts|bus” (team bus). Another peculiarity of German are separable verbs. In German, a verb can split into two parts, a main part that stands behind the subject and a preposition at the end of the sentence. We found that students get confused by separable verbs because they do not understand the role of the little preposition at the end of the sentence. However, our advanced dictionary module can highlight both parts when the students hover over either part to make it clear that these words belong together and we can give the correct dictionary translation as well. Table 5 shows some example sentences and hints. 20

21. Feature(s) conjugation multi-word hints, conjugation multi-word hints, conjugation multi-word hints, phrases German compound word German compound word Example ¿Tú duermes entre ellos? De Florencia nos cuentan, … Disculpe, si me duermo. …, en el que hubo … Donnerstag|morgen Flug|feld|kontrolleur German separable verbs Er kommt heute an. First hint for blue word(s) (you) sleep (they/you-plural) tell (us) (I) sleep where Thursday morning flight field inspector (airfield inspector) (he/she/it/you-plural) arrives Table 5. Features of the dictionary module 10 Research Plan Since Duolingo basically tries to achieve two goals at the same time we must scientifically evaluate it with respect to these two goals. The first main goal of Duolingo is to translate text. From data that we gathered in the private beta, we are very sure that the translations produced by Duolingo are excellent. Nevertheless, we want to evaluate the translations produced by Duolingo with existing metrics, both automated (BLEU (Papineni, Roukos, Ward, & Wei-Jing, 2002), METEOR (Banerjee & Lavie, 2005)) and manual (rating scale) and compare them against translations produced by machine translation and professional translators. We also want to figure out the translation capacity of Duolingo, i.e. how many sentences we can translate per second per active user. This leads to the following two hypotheses: Hypothesis 1. The translations produced by Duolingo are as good as those produced by professional translators. Hypothesis 2. The translation capacity is high enough to translate the web in reasonable time. The second main goal of Duolingo is to teach students a foreign language. We want to figure out if students actually do learn a foreign language with Duolingo. Anecdotal evidence shows that they do. However, we want to evaluate Duolingo with established tests such as the CAPE (Computer-Adaptive Placement Exam) (Madsen, 1991) test, TOEFL (Test of English as a Foreign Language) and ACTFL (The American Council on the Teaching of Foreign Languages) OPIc (Oral Proficiency Interviews) and compare it against existing methods, both from the off- and online world. This leads to our third hypothesis: Hypothesis 3. Students learn with Duolingo as well as with comparable language learning products. A third goal is to derive a theoretical model of student behavior on the site. In particular, we want to develop three models: (a) a knowledge model that predicts which challenges a student 21

22. can answer correctly at any given point in time (similar to the difficulty estimator described in chapter 6.2), (b) a learning model that predicts how the student’s (unobservable) mental state updates after a certain action on the site, (c) a motivational model that predicts if and when a particular student will come back to the site. These lead to hypotheses 4-6. Hypothesis 4. We can build a model that predicts with high accuracy which challenges a student can answer correctly. Hypothesis 5. We can build a model that predicts with high accuracy how the students’ actions affect their learning outcomes. Hypothesis 6. We can build a model that predicts with high accuracy if and when students return to the site. Finally we want to figure out, how the last three goals can help us in improving the site for both learning and translation. In particular, a long-term vision for Duolingo is to build a system that learns how people learn. Thus, Duolingo really is at the center of the intersection of human learning and machine learning. 11 References Amazon. (n.d.). Amazon Mechanical Turk. Retrieved from http://www.mturk.com Banerjee, S., & Lavie, A. (2005). METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. Proceedings of Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization at the 43rd Annual Meeting of the Association of Computational Linguistics (ACL-2005). Ann Arbor, Michigan. Callison-Burch, C. (2009). Fast, cheap, and creative: evaluating translation quality using Amazon's Mechanical Turk. Proceedings of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg. D Sleeman, J. S. (1982). Intelligent Tutoring Systems. Science , 456-462. Damerau, F. J. (1964). A technique for computer detection and correction of spelling errors. Communications of the ACM , 7 (3), p. 171. Gamper, J., & Knapp, J. (2002). A Review of Intelligent CALL Systems. Computer Assisted Language Learning , pp. 329-342. Hirsh, D., & Nation, P. (1992). What Vocabulary Size is Needed to Read Unsimplified Texts for Pleasure. Reading in a Foreign Language , 689-696. Levenshtein, V. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady , pp. 707-710. Madsen, H. S. (1991). Computer-adaptive testing of listening and reading comprehension. Computer-assisted language learning and testing: Research issues and practice , pp. 237-257. Meier, S., & Bruce, S. (1991). Civilization. MicroProse. 22

23. Papineni, K., Roukos, S., Ward, T., & Wei-Jing, Z. (2002). BLEU: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), (pp. 311-318). Philadelphia. Romaine, M., & Richardson, J. (2009). Translation Industry Report 2009. Retrieved April 4, 2012, from mygengo.com: http://mygengo.com/express/report/translation-industry-2009 Shaalan, K. F. (2005, February). An Intelligent Computer Assisted Language Learning System for Arabic Learners. Computer Assisted Language Learning , pp. 81-108. von Ahn, L., Maurer, B., McMillen, C., Abraham, D., & Blum, M. (2008). reCAPTCHA: Human-Based Character Recognition via Web Security Measures. Science , 1465-1468. Zaidan, O. F., & Callison-Burch, C. (2011). Crowdsourcing Translation: Professional Quality from Non-Professionals. Proceedings of ACL. Portland. 23