An anecdotal critique of ClearerThinking's calibration points system. Part II
Thoughts after a conversation with the creator of ClearerThinking
Previous posts in series
Introduction
In the first post, I described why I did not like the points system in ClearerThinking’s calibrate your judgement tool, and shared it on Twitter, tagging the creator Spencer Greenberg.
Spencer Greenberg then offered to have a quick chat so discuss my points. I describe (my understanding of) his response to my points, and end with my thoughts.
Spencer’s responses to my points
It rewards factual knowledge, rather than one’s accurate judgement of their confidence.
Spencer said this is intentional. A good points system ought to reward both one’s callibration and one’s knowledge.
Without rewarding knowledge, one could achieve high scores by constructing trivially optimal answers that correspond to zero knowledge. For example to calibrate at 50% level for True/False questions, you can just flip a coin.
The idea of a good points system is that there are two ways to improve your score: improve your callibration or improve your knowledge. Given that most users are unlikely to improve their knowledge of the questions in the tool, this means that improving one’s score corresponds to improving one’s calibration.
If you are loss averse (like most people are) and hence want to avoid negative points, you are incentivised to make your intervals larger. Alternatively, if you are a risk-taker, then you are incentivised to make your intervals smaller.
Spencer said that the greatest bias that is present is that people are over-confident. This issue of risk is minor / negligible.
It is not clear if the points are correlated with the goal of being well calibrated. Am I trying to maximise the score or aim for a total of zero?
See response to first point.
I do not know how to interpet the average points per question. Is 2 out of 10 good? Bad? Normal?
Spencer said this is a limitation of current set-up. Mentioned that there are systems that can separate the score into the contribution from callibration and the contribution from knowledge. Also, he (or I) suggested that your average score after first 100 questions, say, will become the benchmark for yourself.
The main benefit of [removing points altogether] is that you can focus on what is important, which is being well-calibrated. And the chart provided seems to be the best way to visualse that.
Spencer said that big issue with that is that feedback is slow. Points give immediate (though admittedly noisy) feedback immediately.
To end, Spencer said that they do think the current set-up is not ideal, but not clear how best to improve it. Also, this project is co-owned with another organisation, so it would be effortful to change it or try new things.
My thoughts
I am not surprised that Spencer has clear reasons for why things were done the way they are. However, my overall opinion is still that the points are not helpful (at least for confidence intervals, which is the only task I have experience with).
First, the risk of people gaming the system if you use a score based solely on calibration seems small to me.
The kind of people who choose to use the tool want to improve themselves, and probably have enough intrinsic motivation to do the exercises in the intended spirit, as opposed to gaming the system.
It seems hard to game the system. What is zero-knowledge when somebody asks about pop music? Furthermore, somebody who does manage to maximise the score by ‘gaming the system’ is likely well calibrated.
To reduce the risk of the system being gamed, one can increase the range of styles of questions, with larger range of orders of magnitude. Some possibilities: multiplying two 3-digit numbers (or 2-digit or 4-digit or a mixture…), population of various countries/cities, distances between countries/cities, facts about world records, number of words in certain document/book, facts about resource consumption (e.g. average family/country/organisation use of X per unit time) etc. I realise some of these facts are time dependent, but I am sure there are ways around that.
Second, the current points system is not ideal.
It is hard to interpret. What is a good score or a bad score?
The score does provide any diagnosis. Am I bad in all probabilities, or just 90%, or in the middle percentages, or … . At the end of the day, I still have to look at the chart to work out what I need to change.
Based on all this, here are some suggestions.
Seperate the points into contribution for knowledge and for callibration.
Change the verbal feedback. Instead of relying on user to correctly interpret the score/feedback, say what you want the user to think about the score. E.g. ‘-10. Seems you were overconfident and your interval was too small’ or ‘-0.1. Seems like your interval was well chosen given your lack of knowledge’ or ‘9.5. Well done! You were correct to be highly confident.’
Remove the points but keep some form of verbal feedback.
Every 10 answers, show user the calibration chart with a short comment. E.g. ‘looks like you are over-confident, but need more data. Keep going.’
To end, my overall opinion is that a simple measurement that is easy to interpret (but which may have theoretical flaws) is fine. I presume that the main beneficiaries of the tool are people who have minimal experience of probabilistic thinking and/or who are woefully uncallibrated.