XAudio2 - Peak and RMS

Stats

1,575 visits, 2,576 views

Translations

This tutorial hasn't been translated.

Tools

Introduction

Warning: Unfortunately, the server software interprets the multiply sign as a control char. Therefore, throughout this tutorial all multiplication signs are replaced by the degree sign. So, wherever you see this it means 'multiply'.

The XAudio2 object provides access to some of the features of Microsoft's XAudio2 API. We can't understand some of the expressions in XAudio2, if we don't start with a little explanation about sound. I try to keep everything as simple as possible, so professional sound engineers may see some non-scientific explanations. However, the expressions I focus on are only available for sound, not music. You need to use the .ogg or .wav format to be able to use the sound architecture.

A sound is analog by nature. A computer is a digital medium.

This is important to point out, because it will help understanding a lot of the vocabulary used when dealing with sound on a computer. We need to understand the difference between an analog signal and its digital representation. An analog signal is a continuous signal with an infinite amount of resolution. A digital signal is a representation of a sequence of discrete values. It has a limited amount of resolution.

If an analog signal is represented by a digital medium, we only get a certain subset of the original. The process of scanning an analog signal with a finite resolution is called sampling. If we sample a drawn line with a resolution of 72 pixels per inch, then we get 72 pixels per inch. The information between those points on the line are lost.

The image shows an analog and a digital line at 100% resolution and at 40%. Of course, the analog line has an infinite amount of resolution, we get as much information at 40% as at 100%. The digital line already is just an approximation at 100% and loses information the closer we look.

For XAudio2 we are always talking about the digital representation of a sound.

Ups and Downs

A sound in analog environment is a continous signal with alternating current and voltage. The signal drives a magnet that starts to swing back and forth with a thin membrane shaped as a cone attached that produces air vibrations, which we sense as sound. The alternating voltage is often represented by a coordinate system with 0V centered, +V above and -V below. This is also true for the sound's waveforms.

The image shows a typical waveform (zoomed in to see more details) in its digital representation. The voltage is shown in another scale, dB or decibel. It describes the same signal strength, but dB is much more understandable. The coordinate system has its centered line (here -inf dB, corresponds to 0V) and the parts above (the positive voltage) and below (the negative voltage) that line. You can literally see the swinging of the magnet.

You also see little blue dots. Those are the points that were sampled from the original analog sound. All the values, that the points represent are sent to the sound card when playing a sound.

A standard in modern hardware is 44.1 kHz, and it means the amount of sample points sent per second, 44100 sample points per second. That really is much information, even on todays hardware. The visual representation can't keep up with that rate. Therefore, waveforms are buffered (a part is sent to the sound card, while the next part is loaded to a section in memory, where it waits for sending). This gives us a little more time to investigate the sound data. But still, you will never catch all sample points. To show the value of each sample point while playing the sound would require a frame rate of 44100 fps.

If we would look into the sound wave at time A (represented by the first red line in the image above), we would receive a value of -inf dB (equals silence). But if for some reason the system isn't fast enough and looks into the sound wave with a little delay, then we would receive the value at time B (the second red line), which is close to -6 dB.

The solution to this problem is to always calculate the highest value from the buffer. Now at time A or B we receive -6 dB (given the buffer is wide enough), we don't need to be that much afraid about accuracy. That highest value is called peak. It is not an absolute peak, but only valid for the time at which it is requested.

Peaking in XAudio2

Unfortunately, Construct uses the word "channel" for both, a track and the left/right channel of a track. I will refer to Construct's sound channels as tracks, and the left/right channels as channels.

The XAudio2 object has an expression "Get peak level", or XAudio2.PeakLevel(channel). It returns the actual peak of the master track. It is a stereo track, so you put the number 1 or 2 in brackets, which refers to the left or right channel of the track. The peak is post-fader, which means it returns the signal with the actual volume (both, master and sound track) and pan setting applied. If you set the volume to silence, the peak will also return silence, even if there are the heaviest sounds playing.

The value returned by .PeakLevel is expressed as a so called voltage ratio. You learned about alternating voltage in waveforms above. The ratio is expressed as

V/V0

In a digital environment, V0 always refers to 1V, because everything is converted from alternating to direct voltage and also normalized. The range of the normalized direct voltage is 0V to 1V. The voltage ratio returned by .PeakLevel is a float and ranges from 0 to 1, where zero means no signal strength and one means full signal strength.

It's too bad that this voltage ratio is hard to interpret. Luckily, some intelligent people found a scale that represents voltage ratio in a much more understandable way: the dBFS or decibel Full-Scale. There are two variations on Full-Scale: Full-Scale Square Wave and Full-Scale Sine Wave. XAudio2 uses Full-Scale Square Wave.

With dBFS, the full signal strength is 0 dB and no strength is -inf dB. Might sound even more complicated, but it isn't. The sound pressure doubles with every 3 dB and we feel something double as loud with every 10 dB. The last one varies, but approximately it is valid. For example, a voltage ratio of 0.125 corresponds to -18 dB, 0.25 corresponds to -12 dB, 0.5 to -6 dB, etc. It is much easier to work with dB than voltage ratio.

The formula to convert the voltage ratio of .PeakLevel to dbFS is

20 ° log10(.PeakLevel)

If you have a text object named tPeak and play a sound on a track, then

+ Always

-> tPeak: Set text to

FormatDecimal(20 ° log10(XAudio2.PeakLevel(1)), 1) & "; " &

FormatDecimal(20 ° log10(XAudio2.PeakLevel(2)), 1)

will show you the actual peak of both channels of the track with the resolution of your app's framerate (e.g. 60 times per second). But remember, .PeakLevel returns the peak of the master channel, so if you play two tracks parallel the peaks are the highest values of both sounds playing. There is no track peak in XAudio2.

Updating the peak values 60 times per second is not very eye pleasing. A better way is to buffer the peaks and only show them at a lower rate. This way, you will not lose a peak information while not showing the values, but always show the highest peak so far:

+ Value 'peakL' less than XAudio2.PeakLevel(1)

-> Set 'peakL' to XAudio.PeakLevel(1)

+ Value 'peakR' less than XAudio2.PeakLevel(2)

-> Set 'peakR' to XAudio.PeakLevel(2)

+ Every 250 milliseconds

-> tPeak: Set text to

FormatDecimal(20 ° log10('peakL'), 1) & "; " &

FormatDecimal(20 ° log10('peakR'), 1)

-> Set 'peakL' to 0

-> Set 'peakR' to 0

If you want to get the correlation of the two channels (meaning, if the sound tends to the left or right side), make sure neither of the two will ever be 0. For example, by setting 0.0001 as lower bound the visual representation of the maximum difference between the two channels will be +80 (to the left side) or -80 (to the right side) dB:

+ Value 'peakL' less than XAudio2.PeakLevel(1)

-> Set 'peakL' to XAudio.PeakLevel(1)

+ Value 'peakR' less than XAudio2.PeakLevel(2)

-> Set 'peakR' to XAudio.PeakLevel(2)

+ Every 250 milliseconds

-> tPeak: Set text to FormatDecimal(20 ° log10(max('peakL', 0.0001) / max('peakR', 0.0001), 1)

-> Set 'peakL' to 0

-> Set 'peakR' to 0

But remember: the peak level is post-fader!

You may also want to find the power difference (I like to call it the "pumping") for a certain period of time. Just use two variables, instead of one. For example, to get the power difference in the left channel for a period of 250 ms:

+ Start of layout

-> Set 'lowL' to 1

+ Value 'highL' less than XAudio2.PeakLevel(1)

-> Set 'highL' to XAudio.PeakLevel(1)

+ Value 'lowL' greater than XAudio2.PeakLevel(1)

-> Set 'lowL' to XAudio.PeakLevel(1)

+ Every 250 milliseconds

-> tPeak: Set text to FormatDecimal(20 ° log10(max('highL', 0.0001) / max('lowL', 0.0001), 1)

-> Set 'highL' to 0

-> Set 'lowL' to 1

You have to let your fantasy go wild, then you see the potential of all this. Imagine a rotary sound playing and setting the y-position of an enemy in a horizontal shooter according to the correlation of the channels! Imagine a simple drum loop with high dynamic and let the the bunny in a platform game jump with the "pumping"! I'm sure, you will find thousands of other possibilities.

Root mean what?!

The same way as peak is reported, XAudio2 also offers the RMS value. Wait, RMS? What's that? RMS stands for Root Mean Square, and perfectly describes, how it is evaluated. Although it gets a little more complicated when it comes to infinite, continous streams of data, the most simple form of getting the RMS is

sqrt((x1^2 + x2^2 + ... + xn^2) / n)

This is the same waveform as in the last image, but this time the resolution is much lower. There are only 7 sample points. While we can easily see, that the highest peak is close to 0 dB, we can't see the effective level of that sound. A peak may only be there for a fraction of a second, that doesn't mean we feel this sound playing at 0 dB. Let's try our RMS formula (I am assuming the voltage ratio from what I see):

RMS = sqrt((0.0^2 + 0.75^2 + 0.9^2 + 0.125^2 + 0.7^2 + 0.95^2 + 0.25^2) / 7) ~= 0.64 ~=> -3.9 dB

A perfect sine wave would have a RMS level of -3 dB, so I'm not too bad with my guessing of the 7 values. So, we now know that although the sound may have louder peaks, the overall perception level is -3.9 dB. There is a dynamic in the sound, and the range of that dynamic is peak - rms = 0 - -3.9 = 3.9 dB.

If you already know Dancer, create a sine wave with a sound editor (e.g. Audacity or Wavosaur) and play it in Dancer. It will show a peak of 0 dB and an RMS of -3dB (well, there is rounding going on in Dancer, so it might also be 2.9 or 3.1 dB).

The XAudio2 object returns the RMS value just like peak as a voltage ratio with V0 = 1, so you can use just the same formula for converting to dBFS:

20 ° log10(.RMSLevel)

Also, you have to specify the left or right channel in brackets, 1 for left, 2 for right.

Dynamic, baby!

With both, PeakLevel and RMSLevel, you are able to calculate the dynamic range at any given time:

+ Value 'peakL' less than XAudio2.PeakLevel(1)

-> Set 'peakL' to XAudio.PeakLevel(1)

+ Value 'peakR' less than XAudio2.PeakLevel(2)

-> Set 'peakR' to XAudio.PeakLevel(2)

+ Value 'rmsL' less than XAudio2.PeakLevel(1)

-> Set 'rmsL' to XAudio.PeakLevel(1)

+ Value 'rmsR' less than XAudio2.PeakLevel(2)

-> Set 'rmsR' to XAudio.PeakLevel(2)

+ Every 250 milliseconds

-> tPeak: Set text to FormatDecimal((20 ° log10(max('peakL', 'peakR'))) -

(20 ° log10(min('rmsL', 'rmsR'))), 1)

-> Set 'peakL' to 0

-> Set 'peakR' to 0

-> Set 'rmsL' to 0

-> Set 'rmsR' to 0

Using scales

There are different ways to work with the dB values. Dancer uses a linear scale, spreading the dB Values evenly over the range. Another way is a curved scale, where you have more space (or more intervals) the closer the level gets to 0 dB, and less space towards -inf dB (This is what you will see most in audio workstations or DAWs where there is limited space for a meter). This can be easily achieved by using the following trick:

1) Make your dB range relative to 90. That is, factor = 90 / myRange. For example, Dancer uses a range of 60 dB, 90 / 60 = 1.5

2) Apply this to the sin expression, sin('peakL' ° factor). This will return a value from 0 (=highest peak) to -1 (=lowest peak)

3) Multiply this with the value you need: sin(...) ° area.

For example, to draw the position of the peak within 200 pixels, it's

angle = 'peakL' ° factor

position = sin(angle) ° 200

If you take the abs() of that, 0 dB will be drawn at 0 and you're lowest dB at 200

Surprise, surprise

For all of you, who were interested enough to follow this rather long tutorial, I have a present for you. The complete source for Dancer's left part of the window, meaning the full functionality but without the dancer. Download it here: drm.rar

  • 0 Comments

  • Order by
Want to leave a comment? Login or Register an account!