Notes on optimization of the system

1751068800000

This article focuses on the optimization of the Elo rating system features to improve the testing experience and information extraction during the calibration of the items difficulty ratings.

1 Initial Ratings

To avoid cheating by always clicking the green button, we set the initial ratings of the distractors lower below the zero line than the initial ratings of the keys. In other words, the penalty for recognizing a non-word is bigger than the reward for recognizing a real word, and cheating by pretending knowing the items has a negative effect on the user’s rating when the distractors are selected in equal proportion as the keys.

2 Difficulty management during calibration

Adaptivity is a great idea in theory, that is, when abstracting away the limits in time and resources. But in practice, when the number of test takers are limited as well as the numbers of test sessions they are willing to undergo, the theory hits a painfull, solid wall.

2.1 Double paradox of adaptivity

Adaptivity is necessary to achieve a good experience, but the test starts of as uncalibrated. The bad experience resulting of the initial, uncalibrated test, has as an effect a degraded experience, and thus, the calibration needed to make it a good experience cannot go forward. This risks making the test useless. Allowing some randomness in the selection does not help neither, because the loss of adaptivity means the loss of the advantages it provides in the first place. The test cannot be adaptive, nor random when the set of items becomes too large. Secondly, as items start being calibrated, the uncalibrated items have a tendency to stay “hidden” in the middle of the difficulty distribution. A truly adaptive system would select already tested items as long as the user’s rating is not exactly at the level of the untested items initial rating. This has a tendency of slowing down calibration, and increases the problem seen above.

2.1.2 Solution

We propose to initiate the test with pseudo-random difficulty ratings. We cannot guess how hard will a word be, but based on the double conjecture that most frequent words are more likely to be recognised and that words more frequently used will be shortened (e.g. schiavo tuo > ciao), we can start assigning random ratings based on the length of the words. The range from which the item rating is picked is decided based on its length. This is what we call, pseudo-randomness. But the above solution also means that the ratings, at least during calibration phase, are dependant on the length of the words. this means that the distractors (pseudo-words) have to be produced and shown to the test takers in lengths of equal proportion as the real words, to avoid any cheating.

2.2 Distractors Normalization

In order to deter the idea of cheating by systematicly recognizing any items in the hope that the rewar of the good answer compensates the penalties of the mistakes, the distractors (non words) initial rating is set by the same algorithm as define earlier, but with this difference that a negative value is added. This leftward shift of the values ensures that the penalty for recognizing a non-word is stronger than the reward for recognizing a real-word (note that not recognizing a non-word is not rewarded, only recognizing real words). The only way to increase one’s rating is by recognizing the keys (real-words). But in order for this system to work, the difference between the mean ratings of the keys and the distractors must be add to the rating of the test taker when non-words are being selected. If the difference equals -400, we add -400 to the current rating of the user every time a non-word is going to be selected. As the value of this difference is expected to evolve as the items get calibrated (and is also dependent on the items shortlisted, see below), this difference is dynamically recomputed before each testing session.

3 Items Shortlisting

For performence reasons, a preselected from the complete list of items. This is because, if the total list of items is, say, around 60 000 real words, and as many non-words, we want to avoid having to navigate the complete list twice for every round (the next items are pre-selected twice, one in case of a good answer, once in case of a wrong anser, as soon as a new item is shown to the subject, in order to limit the latency between the rounds). Recalculating so many items in a browser can prove challenging on computers, let alone cell phones, this is why a shortlisting is performed before the test start, to reduce the selectable items to a couple of thousands “only”, but at random, so that all items get an equal chance of being assessed.

3.1 Initial algorithm

Because the the initial consideration was the front-end performance of the software, the only matter was the number of items to be selected from. And this meant that a random selection would do. But as some test sessions start influencing the rating of the items, this random shortlisting proved problematic, the number of assessed items being initially very low compared to the total number or items, a random selection would not select these as much as the unassessed items. This bias towards testing items that have not yet been tested depends on the expected abilities of a random test taker vis-a-vis the average level of the items. As stated above (2) beginers and even intermediates would not be able to gain meaningfull feedback from a random selection. They are also expected to be the primary beneficiaries of such a vocabulary test. There is thus a need for an improved shortlisting algorithm.

3.2 Select by rating

From an information theory perspective, the potential information gained from test session depends on the rating of the items. That is if the entropy of the items rating distribution is higher, more information will be gathered and faster. In other words a test session will several items of the same ratings will bring less information than a test which selects the items with the most common items, which a random selection does. If this formulation seems convoluted, it means the same, in suptance, as what is said in the previous paragraph, but also explains how to better select items. The number of items for a given rating sould be limited during the preselection, alter which, only if hardware limitations requires it, one could allow a random selection from this more rating-wise diverse pool of items.

3.3 Conclusion

If in 2 we’ve seen how initial ratings must randomly set to somewat spreaded ranges of values, 3 informs us that this spread must be limited, so that unassessed items will tend to have redundant ratings in order to have them selected less often because this redundance is synonymous with low information.