Imagine that you want to compare two products A and B and you ask the opinions of 100 users via a survey. The table below shows a summary of the survey and the responses. The numbers under product A and product B show the number of people who gave each of the responses on the left-hand side.
This is known as a Likert scale and this post will give some thoughts on how to analyse these data.
The first thing that is worth mentioning is that there is a simple form of analysis that is relatively uncontentious. This is to say that 60% of people were very satisfied or quite satisfied with product A whereas only 45% of people were similarly very satisfied or quite satisfied with product B. On the one hand this is simple. However, can we use this analysis to say that product A is better than product B? Note one problem straight away, which is that 20% of people are very dissatisfied or quite dissatisfied with product A whereas only 15% of people were similarly very dissatisfied or quite dissatisfied with product B. It seems that product A tends to polarise opinion and it is not clear what conclusions can be drawn.
However, quite often we assign numbers to the categories (such as 5 = very satisfied, 4 = quite satisfied, 3 = neutral, 2 = quite dissatisfied, and 1 = very dissatisfied) and when this is done we can produce a number for each participant’s response; we can then average this to produce the mean values shown in the figure above. According to this we can say that on average the response to product A is 3.6 and to product B is 3.5. Can we now use these numbers to make the following two statements? (1) that product A is better than product B (since 3.6 is bigger than 3.5) and that (2) both products A and B are well received by the participants (since 3.6 and 3.5 are both bigger than 3). What I want to do in this post is discuss the validity of these statements by considering several aspects of Likert scales.
Is it valid to average the numbers?
There is a long-running dispute about whether it is valid to average the scores to produce the mean values as in the table above. To explore this we need to introduce two types of data. The first type are called ordinal data. This is the order in which things are. The Likert scale presented in the table above strictly produces ordinal or rank data. Imagine that three people, Alan, Brian and Clive run a race in which Alan wins, Brian is second, and Clive is third. Knowing the order in which they finished is fine, but it doesn’t tell us whether Alan finished well ahead of the other two or whether, for example, Alan and Brian were involved in a close finish with Clive a long way behind. If, however, we know how many seconds they took to complete the race (Alan = 40 seconds, Brian = 41 seconds, and Clive = 52 seconds) we now know much more information about the race. It turned out that Clive was a long way behind the other two. The race times, in seconds, are called interval data. With interval data the differences between the numbers are meaningful whereas with ordinal (rank) data they are not.
The problem with a Likert scale is that the scale [of very satisfied, quite satisfied, neutral, quite dissatisfied, very dissatisfied, for example] produces ordinal data. We know that very satisfied is better than quite satisfied and quite satisfied is better than neutral, but is the difference between very satisfied and quite satisfied the same as the difference between quite satisfied and neutral? Why am I worrying about this? Because when we assign numbers to the scale (the 1-5 numbers) and then average the responses we are implicitly making the assumption that the scale items are evenly spaced. We are treating the ordinal data as interval data. How can we be sure that the participants treated the scale in this way? Would it have made a difference if we had used satisfied and dissatisfied instead of quite satisfied and quite dissatisfied respectively? So it would seem that is wrong to calculate means from Likert scales. If you click here you will see a post from a PhD student (Achilleas Kostoulas) at the University of Manchester who states categorically that it is wrong to compute means from Likert scale data. I choose this example because it is simply and elegantly explained not because I necessarily agree entirely with his view. It is also worth reading the article by Elaine Allen and Christopher Seaman in Quality Progress (2007) who also take the view that Likert scale data should not be treated as interval data. Interestingly they also suggest some other techniques that don’t suffer from the ‘ordinal-data’ problem; for example, using slider bars to get a response on a continuous scale. However, before you give up detailed analyses of Likert scale data I would urge you to read the paper by Susan Jamieson called Likert scales: how to (ab)use them in Medical Education (2004: 38, 1212-1218). Although Susan is also broadly speaking against treating Likert scale data as interval data she does present the other side of the argument. In another paper, in Advances in Health Sciences Education, Norman (2010, 15 (5), 625-632) argues that the concerns about Likert scales are not serious and we should happily use means and other parametric statistics.
How much bigger do two averages need to be for an effect?
In the table at the start of this article product A and B receive scores of 3.6 and 3.5 respectively. The paragraphs above explain that calculating these means may not be valid. However, assuming that we do calculate means in this way, how different would the mean scores for product A and B need to be for us to conclude that A was better than B? I have come across students (normally in vivas) who would simply state that A is better than B because 3.6 > 3.5. To those students I then would say, would you still take that view if instead of 3.6 and 3.5 it was 3.51 and 3.5? What if it is 3.50001 and 3.5? Would they still maintain that A is better than B? It is clear that we need to consider variance and noise and carry out a proper statistical test to conclude whether 3.6 is significantly greater than 3.5. The test is called a student t-test and anyone can be taught to perform one using Microsoft Excel in a matter of minutes. In the example at the start of this article it turns out that there is no statistically significant difference. We cannot conclude that product A is received better than product B.
However, can we conclude that both products are received favourably? Again, we need a statistical test. It turns out that in this case, both 3.6 and 3.5 are statistically greater than 3 and we can at least conclude that products A and B are received favourably. However, there is the caveat that this assumes that we can treat the Likert scale data as interval data in the first place.
An interesting question is whether we should use 5-point scales at all. Would we get different results if we used a 7-, 9- or 11-point scale? I have found one website that suggests that a 7-point scale is better than a 5-point scale but not by much. A paper by Dawes in International Journal of Market Research (2008: 55 (1)) looked at 5-, 7- and 10-point scales and concluded that the results from a 10-point scale would be different from a 5- or 7-point scale (after suitable normalisation).
Although odd-number scales (with a neutral point) are almost always used. A paper by Garland (Marketing Bulletin, 1991: 2, 66-70) suggest that using a four-point scale (and removing the neutral point) might remove the social desirabiity bias that comes from respondents wanting to please the interviewer. I am not sure what current thinking is on this matter though and I would normally use odd-number scales.
I am not providing any definitive views on these points but rather raising awareness of issues. If you want to use a Likert scale then these are issues you need to familiarise yourself with.
I will confess to having treated Likert scale data as interval data and carrying out parametric statistics (these are statistics that use statistical parameters such as standard deviations). However, deep down I know it is wrong. I am coming to the view that the best thing is not to use a Likert scale at all. I think people often use this sort of scale because it seems simple. There are ways to statistically analyse data like these and I would refer readers to categorical judgement which is a well-used psychophysical technique. My colleague Ronnier Luo at Leeds University has used this technique extensively for decades. However, it is far from simple to analyse the results. I think there are better ways of obtaining information. I think use sliders bars and allowing users to indicate using the slider bar their view between two extremes (e.g. between very satisfied and very dissatisfied) is probably better and I will encourage my students to use this technique in the future.