2 141 words

Introduction

Background and Motivation

Adaptive learning is a software design technic designed around the maturing technology of recommender systems (Anelli et al. 2022) to provide personalized and optimized learning experience to their users (Chen et al. 2017). The recommender system works based on psychometric assessment relevant to the domain the software aims to teach, but no such metric exists in the context of second language (L2) teaching. Today, most recommender systems are implemented in contexts where this metric is readily available, being either attention, in the case of social media, or clients spending, in the case of marketplaces. But how to optimize for a construct as evasive as fluency? How to test quickly enough so that a test can be retaken on a regular basis? How to guaranty that the results are precise enough to identify the small progress made in a short amount of time, and thus rate the pedagogical value of a material?
While specialists are still debating the nature of the constructs that would constitute a definition of fluency (Winke & Brunfaut 2021), the present work does not intend to build a complete language proficiency test, nor does it even support the idea that such a test can ever be created. This does not however mean that progress in L2 acquisition cannot be measured efficiently. Indeed, research have found that mere vocabulary recognition tests yield results correlating with a range of other skills that are usually tested in larger language tests, at different proficiency levels (Meara 1988, Meara 1994, Lemhöfer & Broersma 2012). These tests also have the strength of taking only a few minutes to administer. The rationale behind these tests lies in the fact that identification is the first stage of all further degrees of acquisition. On the other hand, acquiring a new word is a discrete step in a never-ending journey, even in native speakers, making it an excellent basis for psychometric measurement.
But these tests also have caveats limiting their scalability. First, they rely on lists of handcrafted pseudo-words, the creation of which is time-consuming and require expertise in the language in which the pseudo-words are supposed to be plausible. Second, once the words and pseudo-words are constituted, they are tested in preliminary studies, which is too resource-intensive for non-WEIRD environment (Henrich et al. 2010). This study aims to address these two challenges, and proposes a framework to evaluate the value of the psychometric results gathered.

Aims and Objectives

The final aim of this study is to create a metric able to track the speed of progress of L2 learners at any stage of their learning journey, with the prospect that this metric could help to optimize adaptative language learning systems. To this end, this dissertation develops and evaluate solutions that enable both the horizontal and vertical scalability of binary-choice vocabulary tests. The vertical scalability is defined as the ability to increase the capacity of a single tests. The horizontal scalability is defined as the ability to scale up the range of languages that can be tested. To validate the achievement of these goals, the following concrete objectives are fixed:

Vertical scalability: run a large scale online study and obtain consistent results when repeatedly testing the same people.
Horizontal scalability: achieving the above objective with a test for at least one low-resourced language.
Analysis: develop a methodology to assess the relevance of the technical decisions and explore eventual alternative ways to treat the results of a binary-choice vocabulary test in the context of adaptative learning solutions.

Research Question, Hypothesis and Methodology

Research Question

If the dissertation methodology and implications span all across cognitive sciences, the research question is more restricted to the field of applied linguistics and psycholinguistics. By attempting to clarify the nature of vocabulary knowledge across a speaking population, it seeks to evaluate the relevance of the technical solutions introduced in the dissertation.

Does the difficulty distribution of word items in a speaking population support the idea of a continuous or segmented progression in vocabulary level?

This research question leads to the two following hypothesis.

Hypothesis

In the first case, we postulate that, as words have different difficulty levels, the ability to recognize some items of a given difficulty range uniformly increase the chances of recognizing words of a similar difficulty level. This idea of uniform progression in the vocabulary knowledge, where speakers learn words in roughly the same order, analogously to a Zipfian progression, is what we call the linear progression hypothesis or hypothesis 1.
In the second case, we assume that the speakers of a language are segmented into groups among which the distribution of the chances to recognize some groups of items varies at similar vocabulary levels. In this context, a large portion of the vocabulary tested seem to share the same overall chances of being recognized, but for some reason, the tests results would show that groups of people would only be able to recognize domain-specific groups of words. Here, the description of the level of the subjects should reflect the diversity of the domains of the vocabulary known, challenging the idea that words are intrinsically easy or hard. We name this hypothesis the clusters’ hypothesis, or hypothesis 0, because the difficulty distribution of the vocabulary would form clusters instead of displaying a uniform progression. If the second hypothesis turns out to be true, it would show the need to develop a multidimensional representation (Chen et al. 2017), but in a fully data-driven way, for both the expertise of a speaker and the characteristics of a word, instead of the one dimensional, one-size-fits-all, vocabulary level score used in this dissertation to accurately predict the results of a task and build a recommender system relevant to vocabulary knowledge.

Methodology

To answer the research question, the test must be scaled up in order to gather data on large samples of vocabulary, if not an entire dictionary of the language to be investigated. This will be achieved by running an open access test in the form of a web application, on which anyone would be allowed to test themselves. With the large sets of real words to be tested, equally large sets of language-specific pseudo-words have to be created. This is the first obstacle addressed in this section. A second issue discussed here is that of the measurement system for the items, together with the testees themselves, without running preliminary studies.

Generating the Pseudo-Words

A large set of real words can be easily sourced by K-sampling any dictionary, but constituting a large set of orthographically and phonotactically correct words can be challenging. Handcrafting the pseudo-words is not an option for two reasons. First, the process would require expertise in the language in which one desires to create plausible, yet non-real words. Second, even if this expertise were available, the amount of items to produce requires resources beyond what most languages around the world have at their disposal. This is why the pseudo-words must be programmatically generated. To this end, the privileged method today is to treat words as hidden Markov chain models (MCM) of n-gram of characters, usually bigram or trigrams (New et al. 2023). This technique has the advantage of being easy to understand and to implement with limited programming skills, but it has two shortcomings that must be taken into account. First, this algorithm does not allow the creation of really short, but plausible words, less than 3 or 4 characters long, this could help test takers to identify real words or force the test maker to remove shorter words from the real words list. The second problem is more insidious. Some languages have what is called phonotactic long distance relationships, where non-adjacent parts of the words interact together. In Welsh, for instance, some letters or groups of letters are only “legal” in some precise parts of a word, relative to the stressed syllable or the end of the word. As HMCM cannot account for these language-specific subtleties, the words used in the test will be generated by recurrent neural networks (RNN) trained to predict the next character of the real words used in the test.

Measurement Model and Items Selection

The second obstacle is the calibration of the results. Previous tests rely on preliminary studies to select items that can differentiate different levels of proficiency. The format of a dissertation does not allow for preliminary studies, instead, the test proposed here continuously updates the level of difficulty of the word items to assess the level of the test takers, while the level of the test takers is in turn used to calibrate the difficulty level of the items. This problem of co-calibration of the difficulty and skill levels was solved independently several times in history and led to the creation of item response theory (IRT) of which the one parameter model, also called the Rasch model, is the most famous. This system can easily be implemented to assess both the test takers level and the items’ difficulty level. Another divergence from the format of other similar vocabulary tests is that all the word items are not intended to be tested. The test should not take more than five minutes to administer, but the volume of available items is in the thousands. Furthermore, if a test taker only knows a small fraction of these items, a random selection would require many answers, potentially hundreds per testing session, before gathering any statistically meaningful results. In order to stay relevant for all types of language learners, the selection of the items has to display some degree of adaptivity through the implementation of a testing item recommender system. For this, the difficulty rating of the items and the skill rating of the test takers must be updated in real-time. This is what the Elo rating system does, which is a system based on the same equations as the Rasch Model, although invented independently, that uses slightly different parameters to make the ratings human-readable. For a comparison between the Elo rating system with other IRT algorithms, refer to Pelánek (2016).

Evaluate the Results

Once the results are gathered, the hypothesis is tested by calculating the mean silhouette scores, of different K-means clustering. If the silhouette scores stay low for different values of K, then the hypothesis 0 is invalidated and the idea of a continuous progression in vocabulary knowledge supported. If a strong silhouette score can be identified for some value of K, then the hypothesis 1 is invalidated and the need for another, multidimensional representation of vocabulary difficulty and recognition skills is supported.

Limitations

Several limitations must be acknowledged here. Firstly, the test is meant to be open, and no differentiation is made between native and non-native speakers, and between different backgrounds in general, different populations may face different challenges related to their L1 when recognizing real words or pseudo-words (Meara, 2012). Despite the testing process itself can be fast to administer, there is no guaranty that the rating difference over frequent testings will capture progressions made over the span of only a week, especially if the clusters’ hypothesis is confirmed. In fact, the question of reliability should be addressed in light of the results. And finally, the proposed design is not truly scalable to all languages without the creation of an oral version of the test, which would bring other problems. However, it is important to point out that none of these limitations directly conflicts with the research question being investigated. Another risk that must be acknowledged and communicated about clearly to the test takers is that of a potential misinterpretation of the test results and the potential misusages that could result of such misinterpretation. Once again, the test is not a full language tests but primarily a vocabulary test, and if one can argue that vocabulary is the most important construct in learning a language, it is not the only one.

Discussions

Once the results will be analyzed, especially if the hypothesis 1 is invalidated, the goal would naturally be to create a model able to predict as precisely as possible the chances of a word to be recognized from as little test results as possible. This can be achieved with neural networks. Then, such modelization would allow predicting the next most likely words to be known by a given learner, even without historical data on the progressions of test takers. In this sense, a practical extension of the project would be to create a vocabulary list generator based on a test results, or create a language material selector based on either generated or publicly available pedagogical resources. This idea shows the relevance of the test as the basis for a more complex recommender system and its direct application to create an embryo of adaptive language learning system, where the results of previous recommendations will start to inform future recommendations and optimize for long-term growth in the skill measured.

References

Anelli, V.W., Bellogín, A., Di Noia, T., Jannach, D., Pomo, C., 2022. Top-N Recommendation Algorithms: A Quest for the State-of-the-Art, in: Proceedings of the 30th ACM Conference on User Modeling, Adaptation and Personalization, UMAP ’22. Association for Computing Machinery, New York, NY, USA, pp. 121–131. https://doi.org/10.1145/3503252.3531292
Chen, Y., Li, X., Liu, J., Ying, Z., 2017. Recommendation System for Adaptive Learning. Applied Psychological Measurement. https://doi.org/10.1177/0146621617697959
Henrich, J., Heine, S.J., Norenzayan, A., 2010. Most people are not WEIRD. Nature 466, 29–29. https://doi.org/10.1038/466029a
Lemhöfer, K., Broersma, M., 2012. Introducing LexTALE: A quick and valid Lexical Test for Advanced Learners of English. Behav Res 44, 325–343. https://doi.org/10.3758/s13428-011-0146-0
Meara, P., 2012. Imaginary Words, in: The Encyclopedia of Applied Linguistics. John Wiley & Sons, Ltd. https://doi.org/10.1002/9781405198431.wbeal0524
Meara, P., 1994. The complexities of simple vocabulary tests. Curriculum research: Different disciplines and common goals 15–28.
Meara, P., Jones, G., 1988. Vocabulary size as a placement indicator.
New, B., Bourgin, J., Barra, J., Pallier, C., 2023. UniPseudo: A universal pseudoword generator. Quarterly Journal of Experimental Psychology 30. https://doi.org/10.1177/17470218231164373
Pelánek, R., 2016. Applications of the Elo rating system in adaptive educational systems. Computers & Education 98, 169–179. https://doi.org/10.1016/j.compedu.2016.03.017
Winke, P.M., Brunfaut, T., 2021. The Routledge Handbook of Second Language Acquisition and Language Testing, 1st ed. ed, Routledge handbooks. Routledge, Taylor & Francis Group, New York ;