Usability testing is well suited to understand if the user experiences any problems during the use of the product. Further, usability testing can be implemented at times when it is not enough to merely know that a particular usability issue exists but it is required to estimate the scope of that usability issue and it’s span across different experiences, more precisely how many people may experience similar issues when using the final product. It can aid in identifying most severe usability issues when performed on a smaller sample size, or provide fine-grain evaluation with reasonable confidence levels if the data is being collected on a larger sample size (Tullis & Albert, 2013). Even further, this method can help understand the types of error that occur, their frequency and exactly where the problem lies, as well as retrieve the metric of incidents of repeated errors by the same user. Usability testing is a method that works on a medium contextual level, meaning it can be performed both in a controlled environment and users natural environment to observe and draw data of the interaction between the user and the product, but also can be done remotely with the use of digital tools. This makes this method especially suited for evaluating online content and how the user interacts with it. Depending on the end gold of the testing, this method can be performed with different protocols to draw behavioral and attitudinal data. 

Usability testing for retrieval of quantitative data is used to evaluate a situation when the success of the product depends on completing a task or set of tasks. For this a binary success metric can be applied which is based on completed/failed scale. Particularly, it helps to understand the time spent on completing the task and evaluating areas where errors (incorrect actions that may lead to task failure) occur based on the amount of time the user required to perform the prescribed task (Tullis & Albert, 2013). However the researchers must be careful when assigning tasks, specifically, they must clearly define what constitutes success in the current situation, as task success if usually defined as reaching a factually correct or clearly defined state. Without a description of clearly defined state of success in the current situation it may become more difficult to evaluate the issue, therefore the task description must not be vague and must be specific for clearer error definitions. In the case that this protocol is performed in conjunction with self-reports by the user, it can also be interesting to look into the perceived success - when the user believed he performed the task successfully - and factual success - when the user completed the task as defined successfully (Tullis & Albert, 2013). The importance of a clearly defined state of success has a great impact on calculating time spent on completing the task if the efficiency of a product is also measured. The reason why it may be important to calculate that metric is that in most cases a pleasant experience with the product is determined by the speed of desired task completion, but it must be understood that the participants shouldn’t be pressured by the need for quick reaction time during testing. 

To retrieve attitudinal data in conjunction with behavioral, talk out loud protocol can be implemented to the testing. This method of user experience evaluation is widely used in other types of testing as it allows for the participant to express their thoughts in a free and honest manner, such as in interviewing techniques. However it must be understood that based on evidence, in usability testing additional cognitive load of concurrent think aloud causes participants to be less successful in their tasks resulting in an increase of time on task metric. Additionally, when the testing is done in the participants natural environment, the researchers cannot control that context to be clear of interruptions. Finally, there is a high probability of the answers to steer towards a tangential topic or a lengthy interaction with the moderator, causing an increase of time on task metric. Therefore, several tactics are suggested to minimize such noise in the data. Firstly, the researchers should calculate the minimum and maximum amount of time required to complete the task. This is especially essential when the evaluation is happening remotely. Having that metric will allow for the researcher to differentiate between participants that made reasonable attempts at the task and outliners - those who completed the task too quickly or took too long. Secondly, a good solution is to ask the participants to hold any longer comments for the time between the tasks. This can also be implemented in laddering interview technique when the participant is steering away from the main question. After the participant can even be presented with memory retrieval cues (e.g. recordings of eye movement or mouse movement) to remind them about a certain action that they have performed during testing and ask to provide any further comments or thoughts that had occurred during that action (Tullis & Albert, 2013). In case of interviewing techniques, the participants response may also be recorded and used as a cue to start a more in-depth conversation about certain attributes that the user mentioned in their answers but which might require further exploration. 

When the researchers are deciding whether to include metrics from only successful tasks or all tasks, the advantage of including all tasks is that it gives a more accurate reflection of the overall user experience, but that collected data may be highly variable. On the other hand, including only the successful tasks results in a cleaner measure of efficiency. At the same time, analyzing only successful task data introduces dependency between two sets of data in relation to task success data, whereas including all tasks is an independent measure. A solution is proposed that if the participant always determines when to give up on an unsuccessful task, the researcher should include all times in the analysis, and if the moderator sometimes decides when to end an unsuccessful task, then in analysis use only times for the successful tasks.

Reference:

Tullis, T. & Albert B. (2010) Measuring the User Experience. Chapter 4: Performance metrics