- •Introduction
- •Background
- •Related work
- •Problem
- •Method
- •Chosen methods
- •Alternative methods
- •Implementation
- •Result
- •Quasi-experiment
- •Experiment
- •Survey (quantitative)
- •Survey (qualitative)
- •Discussion
- •Conclusion
- •Threats to validity
- •Ethics
- •Comparison to related work
- •Future work
- •Appendix
- •Appendix
- •Appendix
- •Appendix
2 | Method
In this chapter there are three sections. First up there is an argumentation for the chosen methods; secondly, possible (but not chosen) methods are highlighted; and lastly the execution of the motivated approach is displayed.
2.1Chosen methods
To answer RQ1 a quasi-experiment is chosen. Wohlin et al. (2012) describes this as an empirical method where the assignment of treatments to subjects or objects can not be based on randomization. It is also defined as a methodical and systematic study that is conducted under strict and regulated conditions. The essential elements of the method are identified and intentionally altered or controlled while all other factors of the quasi-experiment are kept unchanged. This method seems very well suited for this scenario since the only factor that will be changed are the different treatments (codecs/bitrates).
A controlled experiment is utilized to handle RQ2. This method is based on randomization (in contrast to the quasi variant) where different treatments are applied to dissimilar subjects (Wohlin et al. 2012). When comparing the quality of sound there are several tests and methods available. However, listening trials are regarded as the most reliable practice when it comes to audio quality assessment. The MUSHRA test (ITU-R 2014) requires that the subjects listen to all the sound samples (treatments) in a randomized order which makes it appropriate for this particular experiment and it is typically utilized for the evaluation of codecs with a low bitrate (Zielinski et al. 2008).
The last method is a survey and it is chosen in order to answer RQ3. This is useful when people’s understanding, point of view or conduct is of interest. Common tools for this are interviews and questionnaires which can be utilized to make comparisons or direct conclusions. The results can be either qualitative or quantitative (Wohlin et al. 2012). IJsselsteijn et al. (2013) has come up with a questionnaire to analyze thoughts, feelings and overall experience from a gaming scenario. This served as an inspiration since RQ3 is targeting individuals’ attitude towards auditory quality in scenarios where low latency is crucial.
10
2.2Alternative methods
An alternative way to approach the aim of this research could be a case study. It is defined as an observational study to investigate one or a set of circumstances within its real life context (Wohlin et al. 2012). This could be done by observing musicians or gamers that use True Wireless Stereo (TWS) earbuds in a real time interactive scenario. A great advantage with this method is that the audio codec operates in its true environment, where the actual hardware is utilized in combination with the full Bluetooth stack. This would probably generate results that are more accurate with real life situations. However, the big set back here is the difficulty to differentiate the hardware from the codecs, thus making it more suitable for a product comparison and not a good option for the intent of comparing codecs.
A systematic literature review could also be a path towards the goal of this study. Wohlin et al. (2012) defines that the purpose of this method is to answer a research question based on all available evidence related to it. In the last decade there has been an enormous increase in the market of TWS earbuds which has spurred on a technical evolution in this field that is still ongoing. In parallel to this there is an increase of data and opinions in multiple articles concerning which brand or codec that outperforms the other. Even though this is all interesting, the lack of scientific research becomes increasingly prominent as well. So a literature review is problematic, in the sense that the articles available are more concerning marketing and opinions with (sometimes) hyperbolic language which makes it hard to draw factual conclusions and, more importantly, it makes the scientific approach impossible.
2.3Implementation
In the upcoming section the experiments and the survey will be practically outlined, i.e., a detailed explanation of how this project was executed.
Software
In order to perform the experiments three separate open-source libraries from GitHub were utilized. One for each codec and a third one for the MUSHRA test:
•liblc3 by asoulier (n.d.)
•libopenaptx by pali (n.d.)
•webMUSHRA by fzalkow (n.d.)
The first two libraries enabled encoding and decoding of audio files, whilst the last one was an application for a MUSHRA test, following its rules and guidelines.
Audio samples
Initially a decision had to be made on what kind of audio samples that are suitable to recognize audio deficiencies in lower bitrates. Hoene & Hyder (2010) describe how the SBC codec has a
11
hard time to compress sounds with a higher tonality, such as the sound of a flute. This was recognized with the chosen codecs of this study as well. After listening to different songs and a pure sine wave (both in multiple bitrates), going from 20 to 12,544 Hz, it was concluded that a symphonic classical piece, consisting of strings and flutes, would be the most relevant sample to use since it was easiest to detect shortcomings in those examples. As a contrast, it was hard to distinguish imperfections in a full-band pop setting with a plethora of instruments and audio details going on at the same time. Furthermore it was recognized that Bluetooth SIG (n.d.) had a similar musical piece on their website when evaluating different compression rates for SBC and LC3 in a Bluetooth Audio Codec Demonstration.
All uncompressed audio files in this project were in the Waveform Audio File Format and had a sample rate of 48 kHz with a bit resolution of 24. This was merely because both libraries supported this and thus enabling a direct comparison, even though a sample rate of 44.1 kHz with a bit resolution of 16 would have been enough. The samples were created and bounced from Logic Pro (v. 10.7.7) on a Mac and then later exported from Audacity (v. 3.2) on a PC since that was the only way to make both libraries accept the files.
For the quasi-experiment the approach was to encode and decode a whole song in order for the coders to do some heavy lifting and show their true colors. It was also a way to scale up the execution time to more relatable values for humans to grasp. As Jacob has a documented history of working with music he could contribute with one of his songs (Milton 2020) that had a length of 4 minutes and 12 seconds. This particular piece has a dynamic and wide spectrum of audio information which was chosen to push the limits of the codecs.
In preparation for the next experiment a 12 second long classical arrangement was recorded by Jacob, where the length of the audio file was decided by the upper limit of the MUSHRA application.
Quasi-experiment
As a first step the test environment had to be set up which included both hardware and software. A PC was utilized with the following components:
•AMD Ryzen 5 2600X
•Samsung 970 EVO NVMe™ M.2 SSD
•DDR4 16 GB
•Windows 10 Pro
•VirtualBox (v. 6.1.34)
–Two cores
–Dynamic storage
–6039 MB of RAM
–Ubuntu (v. 20.04.3 LTS)
12
One critical aspect when performing the experiment was to minimize internal processes as well as any external interference. In order to achieve this all unnecessary background processes were interrupted and the internet connection was deactivated.
A timer was implemented for the codec libraries in order to measure the time for execution, as depicted in Figure 2.1, which also required the import of the header file "time.h." The programming language was C in both cases, making it possible to utilize the same solution and thus maintain a fair comparison between the independent variables.
Figure 2.1: Timer
Furthermore, a bash script was designed to execute the encode/decode sequence of the audio sample 100 times. This was done for both libraries/codecs and multiple bitrates.
Figure 2.2: Bash script
For LC3 the bitrate was optional between 16 - 320 kbps and the following were chosen:
• 64 kbps
13
•96 kbps
•128 kbps
•192 kbps
•256 kbps
In the case of aptX there were two compression ratios to choose from, namely 6:1 and 4:1. With a source file of 2304 kbps this resulted in these two bitrates:
•384 kbps
•576 kbps
The LC3 bitrate was decided upon by two factors where the first one was pure listening. At the bottom level it was decided that 64 kbps was bad enough and in the higher bitrates it was hard to discern differences from 128 kbps and upwards. The second factor, and a sort of confirmation, was the fact that Bluetooth SIG (n.d.) used the same bitrates for their Bluetooth Audio Codec Demonstration.
Participants
To conduct a MUSHRA audio test appropriate subjects must be found. According to ITU- R (2014) it is essential to choose assessors who are: experienced at listening to sound in a critical way, when performing a MUSHRA test. This is of more relevance the higher the audio quality that is to be tested, whilst the low bitrate audio compared in this research is more focused on the opposite side of that spectrum, i.e., deficiencies that most people should be able to detect. Nevertheless, to evaluate if subjects qualify as experienced or not ITU-R (2014) presents two different methods: a pre-screening of assessors or a combination of the former and a post-screening method. Performing a pre-screening of assessors had required the subjects to participate on more than one occasion, which probably would have made them think twice before volunteering. It would also have taken more time which was not available. So, to validate the assessors’ capabilities of hearing audio impairments in this experiment the post-screening method was performed.
For the survey, which aims to answer RQ3, the population of interest was people with experience of utilizing audio devices in an interactive scenario. This could include almost anyone since Voice over Internet Protocol (VoIP) (Zoom, Discord, FaceTime etc.) is widely used today. However, the goal here was to also target two groups that could bring interesting views when it comes to testing the limit of this technology. This was chosen to be gamers and musicians.
The audio test and the survey are tightly connected as the questions will evaluate people’s attitude toward the audio samples presented in the MUSHRA test. This means that the subjects are expected to perform these in combination, first the audio test and then the survey. So the participants had to fit the profile for both.
In order to recruit subjects for all groups an invitation was sent to the music and audio department at the University of Skövde, where both teachers and students were invited. In addition,
14
invitations were sent to hand picked people with several years of experience when it comes to gaming, playing music as well as utilizing VoIP.
MUSHRA
Once again, a test environment had to be prepared consisting of relevant hardware and software. For this experiment a MacBook Pro 2017 was utilized in combination with a pair of Sennheiser HD 650 headphones. These are mostly used by professionals when mixing and mastering in a studio setting. Hence, they are able to showcase a detailed audio spectrum in contrast to standard commercial headphones (Hines et al. 2014). Two programs were installed: XAMPP (v. 8.2.0-0) and the MUSHRA application. The former is needed to create a web server that can host the latter since the test will run on a web browser.
The MUSHRA application is designed to run a listening assessment test according to the rules described by ITU-R (2014). This is done by providing an intuitive Graphical User Interface, see Appendix A, where a set of audio samples can be listened to and rated concurrently. The samples are compared to an uncompressed (high-quality) reference track. Once the play button is pressed the user can freely switch between samples while the audio continues at the same place uninterrupted. It is also possible to loop a specific section if one wishes to do so. The application will automatically randomize the order of the samples before each new run and thus establish a double-blind scenario, i.e., both the participant and the conductor are unaware of the sorting of the samples beforehand. In addition, a hidden copy of the reference as well as two anchors (degraded samples) will be created and hidden amongst the others. The idea here is that any subject that will give these samples odd ratings should be excluded in the post-screening.
When performing the test each participant started by accepting a form of consent (Appendix C). After this the subjects were instructed to rate how the samples fared in comparison to the reference track and give them a rating from 0 to 100, where the reference was explained to represent a score of 100 points. The rating scale was outlined as follows:
•0 - 20 Bad
•20 - 40 Poor
•40 - 60 Fair
•60 - 80 Good
•80 - 100 Excellent
They were also told to set a comfortable volume at the beginning and not change this during the test. The tests were performed in quiet locations under no time restraints. Below are some pictures (Figure 2.3 & 2.4) of the studio at the University of Skövde where the students and teachers of the music and audio department executed their tests.
15
Figure 2.3: Booth to fill out form of consent
Figure 2.4: Shielded location for the test
Survey
To set up a questionnaire for the survey a web software was used called Jotform (Appendix B).
After an initial introduction text the test consisted of two parts. The first one being a profiling section to get a deeper understanding of the individual’s previous experience in three different scenarios:
• VoIP
16
•Playing games
•Playing music
After a set of questions the participant was asked to pick one of these scenarios as a base point before continuing with the questionnaire. In the second part the subject had to circle back to the samples that were used in the MUSHRA test, listening again if they wished to do so. With these in mind they were asked to score, from 0 - 5, how satisfied they were with the samples they rated in a specific spectrum of the rating scale utilized in the MUSHRA application. This was first done for the chosen interactive scenario and then for a non-interactive scenario. A question for a specific interval in the interactive scenario was phrased like this:
•For this scenario, how satisfied would you be (zero being the lowest) with the samples you rated... 80-99?
For the non-interactive scenario the questions were outlined like this:
•In a non-interactive scenario, such as watching a movie or listening to music, how satisfied would you be with the samples you rated. . . 80-99?
At the end of both the interactive and the non-interactive section an open-ended question was asked where the individual could elaborate freely if they had some further thoughts or comments.
17
