What is Mean Opinion Score?

Mean Opinion Score (MOS) is a quantitative assessment of the overall quality of an event or experience based on human judgment (This story was originally posted on bettermos.com)

Akash Raj
4 min readApr 15, 2023

In today’s digital age, with the increasing reliance on communication services like voice calls, video streams, and virtual meetings, ensuring high-quality audio and video experiences has become paramount. But how do we measure the perceived quality of these multimedia services objectively? Enter Mean Opinion Score (MOS), a widely used and standardized method that provides insights into how humans perceive the quality of audio and visual content.

MOS is a subjective scoring system that has been widely adopted in the field of telecommunications and multimedia to assess the quality of communication services. It relies on human perception and judgment to evaluate the perceived quality of audio and video signals. MOS has become an essential tool for researchers, engineers, and service providers to quantify and benchmark the performance of multimedia systems, codecs, and network protocols.

  1. Many fields. MOS is used in telecommunications to evaluate voice calls, video conferences, and other multimedia communication services, ensuring they meet quality standards.
  2. Optimization. MOS is important in multimedia development for assessing the impact of coding techniques, transmission protocols, and system optimizations.
  3. Human in the loop. MOS provides a human-centric evaluation of audio and video quality, capturing subjective factors that automated metrics may miss, making it a valuable tool for quality assessment.

In this article, we will explore the concept, significance, calculation, factors, scales, limitations, and applications of MOS, offering a comprehensive understanding of its importance in today’s digital landscape.

How is mean opinion score determined?

The mathematical basis for MOS calculation involves taking the average of the subjective ratings given by a group of human evaluators. The evaluators are usually asked to rate the audio or video signal on a scale of 1 to 5 based on its perceived quality. The MOS is then calculated as the arithmetic mean of these ratings.

“MOS serves as a metascore that is calculated by averaging the scores of various individual components that collectively assess the quality of a session.”

MOS is typically calculated by averaging the scores provided by a group of human evaluators who rate the quality of multimedia content based on their perception. These evaluators are carefully selected and follow specific evaluation guidelines. The scores are then compiled and averaged to obtain the final MOS, which represents the overall perceived quality of the multimedia service being evaluated

The Absolute Category Ranking (ACR) scale is widely used in MOS evaluations.

Several types of listening tests that were commonly used in the 1990s in telephony was standardized in ITU-T P.800 [1]. Subjective listening tests are generally regarded as the most reliable/definitive way of assessing audio quality. The ACR scale ranges from 1 to 5, with corresponding levels of Excellent, Good, Fair, Poor, and Bad as shown in the above table.

Human evaluators tend to avoid perfect ratings in MOS evaluations, resulting in objective approximations. An excellent quality target is typically considered to be around 4.3–4.5 on the MOS scale. On the lower end, call or video quality is generally considered unacceptable when the MOS drops below approximately 3.5.

crowdMOS

“… propose a cost-effective and convenient measure called crowdMOS, obtained by having internet users participate in a MOS-like listening study. Workers listen and rate sentences at their leisure, using their own hardware, in an environment of their choice. Since these individuals cannot be supervised, we propose methods for detecting and discarding inaccurate scores …” [2]

Typically, subjective quality measures require a) there are enough listeners of sufficient diversity to deliver statistically significant results, b) experiments conducted in a controlled environment with specific equipment and c) every subject receives the same stimulus and instructions [3, 4].

In crowdMOS, the authors propose a class of listenting tests that are obtained by relaxing b) requirement. Instead of a controlled experiment, the experiment is outsourced to workers from an internet crowd. Because of the relaxed nature, such a study can easily have a large diverse pool compared to the typical MOS test.

Many companies offer tools and platforms to facilitate or mediate the process of crowdsourcing for MOS evaluations. One of the most well-known platforms for crowdsourcing is Amazon Mechanical Turk. While these solutions exist, they are not always nice to work with. They involve a lot of time doing repeated tasks and that’s where BetterMOS comes in. BetterMOS platform offers a platform to create beautiful and engaging listening tests. It also provides a simple and intuitive dashboard to manage and analyze the results — subjective results and Deep learning metrics such as MOSNet [5]and NISQA [6].

References

  1. Methods for subjective determination of transmission quality. ITU-T Recommendation P.800, Aug. 1996.
  2. crowdMOS: An approach for crowdsourcing mean opinion score studies. Ribeiro et al. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2011
  3. Methods for the subjective assessment of small impairments in audio systems including multichannel sound systems. ITU-R Recommendation BS.1116–1. Oct. 1997.
  4. Method for the subjective assessment of intermediate quality level of coding systems. ITU-R Recommendation BS.1534–1. Jan. 2001.
  5. MOSNet: Deep Learning based Objective Assessment for Voice Conversion. Lo et al. 2019. arXiv preprint arXiv:1904.08352.

--

--