Tag Archives: magnitude estimation

Incomplete pair comparison

One of my big academic interests is scaling perceptual phenomena. That is, we take some physical stimuli (for example, a set of sounds of varying intensity/volume) and then we want to know how loud they are perceived to be by people. This allows us to build a relationship between the physical stimulus (in this case intensity) and the perceptual stimulus (in this case loudness). The same idea could be used to scale largeness, smallness, colourfulness, whiteness, lightness, heaviness, sweetness etc. It’s not always a -ness. But it usually is.

There are a great many techniques to scale perception. You can just ask people, for example, to assign a number. For example, you play a sound and ask them to rate how loud it is on a scale, say, from 0 to 100. This is called Magnitude Estimation (ME). It’s a perfectly good technique but it has limitations and one of these is that it can be quite difficult for the participant. And, say, the first stimulus seems really loud and they assign it a loudness of 90; then it turns out that all the subsequent stimuli are louder – then all their estimations will be squeezed in the 90-100 range, which is not ideal. Consequently, in the ME technique we often have so-called anchors – that is, example stimuli at each end of the scale.

An alternative technique is called paired comparison (PC). In this we might have, for example, five stimuli A, B, C, D and E and we present them in pairs and ask the participants which one is louder (or whiter or yellower etc.). The total number of paired comparisons is 10 in this case which is quite manageable. From the results of these paired comparisons it is possible to estimate a scale value for each of the stimuli where the scale value will be an interval scale of loudness (or whiteness or yellowness, etc.). This is a really nice technique and there are quite a few papers that claim that PC is more reliable than ME, for example. However, when the number of stimuli is large the number of pair comparisons becomes huge and the the task is not practicable. When this happens it is possible to undertake so-called incomplete pair comparison where we only present some of the possible pairs to the participants. The question is, however, what proportion of the pairs should be present for the PC experiment to be reliable?

This was the question that Yuan Li and I asked each other during her doctoral research. We undertook a large-scale simulation of a PC experiment. I won’t go into the details here. The method and results have just been published in the Journal of Imaging Science and Technology (JIST). You can see the paper here.

However, I show below the key table from the research which I think might be of interest to other people who are undertaking, or planning to undertake, an incomplete PC experiment.


This table shows the number of stimuli that are being compared along the top. Down the left-hand side are the number of observers taking part. The figure in the corresponding row and column shows the per cent of pair comparisons that need to be carried out to get robust results that would be similar to those you would get if you did the full PC experiment. So, for example, if you 20 samples and 15 participants then you need to half of the possible comparisons. For 20 samples there are 190 comparisons so you would need to 95 of them (which could be selected randomly).

I should point out that there is a caveat that needs to be considered. This work is only valid if the observers can be considered to be stochastically identical. If we ask people to rate samples for loudness, or whiteness, or heaviness, for example, I think this assumption is justified. However, if we were asking people to scale how beautiful people’s face were, for example, – an experiment reminiscent of the early facebook experiment by Mark Zuckerberg – then observers could differ wildly in their judgements. One participant may rate as most beautiful a face that another participant rates as the least beautiful. Because of the assumptions that we made in our modelling we cannot predict the proportion of pair comparisons that would be needed in a case like this. We are thinking about it though.