Efficient and Perceptually Plausible 3-D Sound For Virtual Reality

Efficient and Perceptually Plausible 3-D Sound For Virtual Reality

Articles, Blog , , , , , , , , 1 Comment


>>Yeah. So hello everyone
here at Redmond lab. So it’s my pleasure to
introduce Fabian Brinkmann. Is it Dr. Fabian Brinkmann,
may I call you like that?>>Yes. It’s safe
for you to say that? I’m not sure if I’m already a Doctor.>>Yes, Dr. Fabian Brinkmann, who did his PhD in TU
Berlin, with Technische, on virtual acoustics for
audio and virtual reality, and he will present his
internship talk today about efficient and perceptually
plausible 3-D sound for virtual reality. With that, the stage is yours Fabian.>>Thank you Sebastian
for the introduction. Let’s get right to it and
see what’s hiding behind that awfully long and
complicated title. So when we are in virtual reality, we want to provide perceptually
plausible audio for that. So it means it should match the listener’s expectations
towards what they’re seeing. For example, if there’s
a speaker in the room, in the virtual reality
or appointed reality, the audio should match the
location of that speaker, and it also should match the acoustics of the
environment which we’re in. If we succeed in doing that, this will improve the presence
and immersion of the simulation, if we fail, it has the ability to degrade
that to a certain degree. The problem with that is
most often on those devices, the graphics rendering consumes most of the compute and physically
correct or physically good room acoustical simulation
is also very expensive. So there’s a conflict
there that we need to moderate and let’s try to
find a way around that. This is what my talk will be about. So before we get to it, I want to briefly introduce a special hearing because that
will be relevant for the talk. The picture here shows that. The person on the shore is trying to analyze the scene on
the lake for example, to find out where the
objects are located, but she’s doing that not
by watching on the sea, but only observing the
movements of the waves at the two membranes
that you see here, and it is exactly that what
our auditory system is doing when analyzing the sound
pressure signals at our ears. This might look a bit like this. For example, if we have a sound
source coming from the left, we will observe these
pressure signals at our ears and because our ears
are spatially separated, we will have so-called interaural
time difference and there’s also a head shadow which causes an
interaural level difference. These are the two main cues for estimating the left-right
sauce position. However, those cues aren’t of much help if a sound
source is to our front, because then these interaural
differences or the differences between the ears will be
too small to evaluate. In which case, we have to look into the spectrum where we find
so-called monaural spectral cues. These are dips and peek at the spectrum which are specific
to the height of a sound source. So these are learned
patterns that we can exploit to estimate the height, so up-down direction
of a sound source. So this shows that
everything we need for spatial hearing is contained
at our year signals, and these signals we
call them head-related impulse responses if
we’re in the time domain, or head-related transfer
functions if we are in the spectral domain. I will use those two terms interchangeably because
for us in this talk, the difference doesn’t matter. So how can we use the these transfer functions
for acoustic simulation? We will most likely
start with a room model, which contains the scene geometry, but also information about the acoustic properties
of the surfaces, and using that together
with a model of our sources and receivers,
and most important, an algorithm that computes the
sound transmission in this room, we can simulate the
so-called impulse response or room impulse response. This impulse response is a
mathematic- constructed, will contain all the
information about the room, so we see the direct sound
here and some reflections. Because we simulated it, they also have directional
information attached to it, so we call it the spatial
room impulse response, which means we know where
the direct sound is coming from an other reflections, and we can use that together with our head-related
transfer functions to make the simulation
audible for the listener. However as I said, if you want to do that
physically correct, that is very expensive often, and we won’t have the
computing power to do that in some applications at least. So a way around that would be
to try to derive a time model, a parametric time representation
of what we simulated. This would mean we can simulate offline and then try to
derive at this presentation. A very common presentation would
be to have a direct sound or first sound which has
a time of arrival, a direction of arrival, and level, followed by some early reflections
with the same information, and then at some point, we will arrive at the so-called
perceptual mixing time. This is the line here. After that time, we are not able to perceive any
direction information. So once we are there, we can have a very simple model
for the late re-operation, which might be as simple
as a decaying noise, and this might be that part
of the impulse response. The advantages are; it is relatively
cheap to render in realtime, yet it can be very
plausible perceptually. The parameters should be of low memory close to store
for realtime rendering, and it also allows
aesthetic modifications. So we can at anytime, change for example the
level or the decay rate of the late reiteration if we want
to do some artistic changes. So to state for example, for a game developers,
or your producers. So this set the scene
for what I want to do. I want to have a parameter D and encoding of spatial
impulse responses, and I will especially focus on including early
reflections in this pipeline. However, the encoding
which can be offline, has some tricks to it, because whether or not a
reflection might be audible, depends on the environment, it depends on the audio
content that we’re playing, at the same time, we want to
have a scalable solution. So for example, if
our application tells us be as realistic as you can, we want to be able to provide that, if it tells us our guy is
playing really fancy game, we don’t have so much compute left, here is what you have deal with it, that we want to account for. We want to have a smooth spatial
distribution of the parameters, meaning if I extract them at two positions in space
that are close together, the parameters should be similar. If they’re not as I walk through
a simulation and we render it, there might be wiered jumps
for example in the loudness. Then ideally, but I
won’t focus on that, the encoding should be of low cost, because in the application or
in one of the applications, we have in mind, there will be a massive parallelization
of encoding, lots of positions in space at the same time and it should
also be off low memory, which means it shouldn’t rely on too many future or
past points because otherwise we would have
to store too much in memory which is also a drawback. Then the decoding should be in real-time and I’m
repeating the title here, it should be efficient and
perceptually plausible. So these are our goals. Let’s try to have a look at the underlying perceptual
phenomena that we can exploit. So we want to get an idea of what reflections
might be relevant and how we might be able to extract them and this will lead us to
the so-called precedence effect. This effect describes
how the perception of one sound changes if a second sound which is spatially separated
plays a delayed copy of that. So if they are playing
at the same time, there is a phenomenon
called summing location, and there will be a phantom source
between those two loudspeakers. This is what we usually
exploit in stereo listening, known for quite a long time. As the delay between
those sources increases, summing location breaks down. If it increases more, we’re in the zone of
localization dominance, which means we perceive the sound as coming from the direction
of our first speaker. This however, does not mean that
this guy here will be inaudible, it still can have
audible contributions. For example, it can change the
timbre of this first sound here, it might change the perception
of the swath width. So localization is only one
aspect we have to deal with. These two first zones is what we usually deal with
in room acoustics. We won’t go into the echo region, where we hear two
separate sound sources. What’s of course also
important in room acoustics is as the sound travels through time and space so to say and
bounces off the walls, the energy gets reuse due to air damping and
absorption on the walls. So we really have delayed and reduced copies of our original sound. What you see here are
so-called masking thresholds, which is a dependency
on time and level, and whenever a reflection
will be below this line, it will be inaudible. So have no effect on
what we perceive, which means we could
as well discard it. Depending on what you look at, these curves have different shapes. If you’re in an un-
echoic environment, it’s just a simple line. If you add reverberation to
the equation, this changes. There are two curves
that have been proposed, I don’t want to go
too much into detail. I will go with this v-shaped curve. It shows an increased sensitivity
after the direct sound, and then as the reflections and
reverberations appear or kick in, this sensitivity decreases again, and reach a saturation. So this is what we want to mimic. Besides the dependency on
the level there’s also a dependency on the
spatial separation, which means a reflection has
a higher potential to be audible if it’s spatially
separated from the first sound. This is an effect that
can be up to 10 or 15 dB. So we know what we want
to do and we might have an idea of what we have to
consider and how we might do it. So let’s try something. Before we can do that we
have to get some data to work on which is why I generated a database consisting of nine rooms. They have three different
reverberation times from rather dry to rather wet and
three different volumes. To cover a large range of coastal conditions that we
might encounter in real life. I use two different simulation
methods to get my database. The first one is an image
source model which calculates the exact direct sound and early reflections and I
use the image source model. It’s the black contribution here of the room impulse response up to 1.5 times the perceptual mixing time. For this room it’s about here, and then after that I used a simple decaying noise
for the late bilberation. I did that because my algorithm focuses on the early reflections
and I wanted to make sure that in my renderings that I will later compared
to this reference all the differences
that you hear will be related to the other reflections
and what I do to them. This is another way of looking
at this impulse response. This now additionally shows
you the spatial information. So here we still have the time axis. Each dot here is one line. On this side, the color will show you the amplitude of the reflection, and in this plot, it will show
you the left-right direction, zero means it’s coming
from the front. So this is the direct sound 90 degrees means it comes from the left and minus 90 means
it comes from the right. The bottom plot here shows you the polar angle which is
the up-down direction. Zero is from the front
minus 90 is below. This would be above behind and below. Again, so it goes around us. This will be the
presentation that I will be using for most of
the rest of the talk. The second simulation are
used is called Triton which is developed here
at Microsoft research. It is a wave based
simulation and due to its computational complexity
and memory consumption, I simulated only the small room. So 200 cubic meters and had an upper cut-off
frequency of eight kilohertz. Triton simulates the
pressure impulse response. We also want to have
directional information and we can obtain that from the
so-called sound intensity. You can think of that as a
Cartesian vector with x, y, z information that will give you
information about the direction of arrival of each sample
in your impulse response. I won’t go into details here. What is important that we
have to apply a low pass to the direction information and we
will see why on the next slide. Keep in mind I applied the
low-pass at two kilohertz. So here we see the spatial room impulse
response simulated with Triton. The green dots is the
Triton impulse response and the red dots are the positions obtained with
the image source model. So with the first method that
doesn’t have a band limitation. If we look at the first part the green dots and the red circles
seem to agree pretty well. However, what we already
see there is a spread here. So ideally, there should be
one dark green dot here, but we see some spread, and this is due to the band
limitation of the simulation. A band limitation means that our time signals will have
pre-ringing impulse ringing to it. There’s basically not
a good way around it, and because we derive
our spatial information, so our direction formation also
from those pressure responses, it means that there’s ringing oscillation and
the spatial information which causes this spread here.>>With that explanation
for the areas where there’s a correspondence
between the red and green, but what about that very thin
green dot up at the top? Above 45 degrees there’s
no corresponding red dot. Do you know where
that might come from? Is that a artifact
simulation or [inaudible].>>Yeah that might be
related to noise in the simulation and keep
in mind this is very low. So it’s like maybe between
minus 20 and 30 degree below. But I think that’s noise. So that might mention
doesn’t account for that. It does however account
for this you see here we have something coming from the front and from the back it’s because
the direction information is only contained in the z direction
and if that oscillates, it changes between front and back.>>So how did you obtain
those green dots? Did you get that from Triton or did you extract that
yourself by picking or?>>No. So this is the
raw data obtained from Triton combining the
pressure impulse response with the direction information. So no processing yet. If we go further into time
we see that this widening gets worse and worse and we
obtain lines rather than points. The reason for that is the low
pass that we had to apply. And we had a 2Ko low-pass
which means we can only account for up to two reflections
arriving per millisecond. For rectangular rooms,
we can calculate the so-called reflection density
which will exactly show us at what point in time we exceed
this limit and for a room of this volume we see that at
about 25 to 30 milliseconds, this limit is exceeded
and it’s pretty much at that point and time where our image here gets
very blurry and messy. So we have to keep in
mind that we might get valid results only
in this region here.>>Shouldn’t we [inaudible] also
at the tone and frequency are the same look at the time and
frequency if they are in 2D.>>Can you say it again?>>I’m considering also
the frequency domain. How it works in different ways.>>So for now because
I was focusing on the early reflections I have no frequency dependency
in my simulation. But there would be ways
of considering that. So this is what I propose
for the encoding and we will start with estimating
the direct sound. This means we’re estimating
the time of arrival, amplitude and direction of arrival and this here will
be our left right angle, the lateral angle and this the
updown angle or the polar angle. Once we have that, we will initiate a masking
threshold starting at that direct sound and every contribution that will
exceed that masking threshold, we will consider a potentially audible reflection in the first step, and if that happens we will assign the same parameters to that guys. This will give us a list of reflections that might
be longer than we want to or that we can
achieve during rendering. So there should be an
algorithm to select a fixed number of reflections from the list we obtained
in the first place, and in the last step, we of course have to estimate
the late reverberation, and I’m doing that based on the residual energy which
means the energy of the impulse response including the contributions that we consider
to be an early reflection, and the difference is
these guys we detect here, we will render them directly using head-related impulse
responses which means we have to do convolution
which is rather expensive, and for the late reverberation there are computationally more
efficient ways of doing that. So let’s look at the first step. How do we estimate
the time of arrival. I was using an onset
detector that was previously suggested here
and I won’t go into detail, but it was shown to
be quite robust and provide spatially smooth estimates
for the time of arrival. Then for the amplitude, I simply did a root
and square average. So here we have our
pressure signal and what’s important is the time-bounds
are used for that. I start half a millisecond before the maximum
associated to this onset here and I go up to one millisecond afterwards which accounts
for summing location. So I assume a reflection or
if I say reflection now, it’s not a physical reflection. It will be an audible event and it can have a
certain extent and time, and this is the window I chose. So the other one milliseconds
that’s like a rule of thumb. It’s about the time that summing location works
for most of the signals. It’s a rather conservative estimate. So this is how I estimate
the direction of arrival and I found a very complicated
way to write it down. What it basically tells you, it’s a weighted average. So I average the left right angle, the lateral angle with the
squared and pulse response. I use the same time-bounds
and I of course have to normalize for what I
have here as weights, and this guy here has the range
between plus and minus 90, so I can average it directly. The lateral angle it goes
around the clock so to say, so I have to apply a spherical
averaging which is why I use the pressure impulse
response or the weights at the absolute value and
then the angle is here, and then get the angle again and everything again in
the same time-bounds.>>I got one question.>>Yes.>>So for encoding, you don’t have to ground
truth from these [inaudible]. How do you get the 5T.>>So the ground truth is the
image source model in my case. So I generated a special
impulse responses to work on. For the image source model, I have the exact locations. They’re just part of the model and for the type of simulations this is what I derived from the intensity. So I assume that these
angles have to be available for the
algorithm to work but the use cases I have in
mind they usually have them but there are other use cases where they
are very tricky to get.>>Do that integration because
it can happen that you have still a multiple
reflections in that Window, because ideally you assume
you have only one reflection.>>Yes, but as soon as we may have a bent limitation we
will have a spread. So we don’t have ideal impulses. With images source model we
have, with other we wouldn’t. So this is one thing to
account for it and then also as we move in time further
into the impulse response, there will be reflections
that are very close together and some
of them might be good. This is what you had
in mind, right? Yes.>>Is your front-back
confusion normally happens at low powered time?>>So the funny thing about this is, I don’t care about
front-back confusion here. The reason being is I looked into the number of sources that we can
perceive at the same time, and there has been a study that
showed that we are rather good in perceiving sources at
the same time if they have a lateral separation, and we are very bad at perceiving sources at the same time if they’re
on the same cone of confusion. So this is just average front-back but this might have some funny effects if you’re further in time away
from the direct sound. We will see it in a second. So now let’s get to the implementation
of the masking threshold. As I said, I wanted
to model this curve here and I did so by having a slope. So a decay rate of one
dB per millisecond. An offset of minus 10 dB. So the curve starts at
10 dB below the level of the direct sound and
this V shape here, I tried to approach by adding 35 percent and this number
is just trial and error, 35 percent of the energy of
all the reflections here to this threshold curve and
this looks like that. So here we have our impulse response. This is our threshold
curve and you can see there is a reflection it peaks above the
threshold, so it’s audible. That is why it has a red
dot and once we pass this reflection 35 percent of its energy are added to the
threshold which is why it’s good. So it goes up and at least
this is a very discrete thing, but it approaches this V
shape that I wanted to model. Now you also see and also
please ignore this red line. This is the echo threshold. There’s an idea to
that but I didn’t use it so far and you also
see this is not a line. It’s an area and this shows you
the spatial dependency of that. So if we look at the
masking threshold with respect to the lateral angle, this is our direction of
the direct sound and as as the reflection moves away in
space from the direct sound, our threshold drops and we
become more sensitive to that and I chose a
depth of 10 dB here. So these are results
of the first step, so off throwing in a special room pulse response and
trying to detect reflections. This is the image source. An impulse response generates
with the image source model. The blue dots show you-all the contributions that are
not considered to be audible, and the red dots are the contributions that
are considered audible, and the circles around them give you the direction that
I assigned to them. You see in the early part there’s almost a one-to-one correspondence
between those guys, and here you see at some point the reflections in the middle don’t
get detected anymore and that is due to our
spatial dependency of the masking threshold which
is at discussion at the end. As you go in time this is
what you asked earlier, there might be more than
one red dot associated to a reflection that we
assigned because there are closely linked in space and time. This is how it looks like if we throw the triton simulation
into the algorithm so the band-limited
simulation and you see that if you try to compare left
and right in the first part of it, there seems to be a
good correspondence. Then, as we move in time, we get those problems related to low passing the
direction information. What we get there might
be a bit less reliable. You also see what you address
like the font back thing. So for example, this guy here it
has lots of contributions from Wyatt angular range
regarding up down position. So we now have a set of potentially
audible early reflections. I tuned the parameters
for a couple of things. So the idea is there shouldn’t
be too much reflections detected after the mixing time
which is this line here. I wanted to detect the floor
reflection because it was deemed important in lots
of earliest studies. I wanted to detect at least 10 reflections because it’s the first two digit numbers and people don’t ask
questions if you do that. So it’s brute force but I also did in former listening and that seems
to be a reasonable limit. So now the next step would be to pick a fixed number of reflections
from the set we ended up with. So in this plot, I only show you the reflections
that are potentially audible. So everything else is
already out of the equation. I tried three very simple
methods for selecting, 1, 2, 3, 4 reflections. So the first idea is
also called first very convenient and it will simply
select the first reflections. In this case, the first
four reflections. What might be critical with that, so the good thing it selects
the floor reflection. The bad thing in this case, it will select three reflections
to the left which might cause an imbalance in the rendering if there is no
reflection from the light included. We see here it picks
a reflection with relatively low level although like shortly after there’s a reflection
that has a larger level. So let’s see maybe not a good idea. So the next thought was to pick the reflections that peak above
the masking threshold the most. So I see how much above the
masking threshold I pick the four that are the highest above that. The good thing we get a more balanced left-right
contribution here. The bad thing we don’t
pick the floor reflection. Next iteration, again very simple. Pick the loudest reflections and yes we get the
floor reflections and yes there seems to be a rather balanced selection
of in the left right sense. I checked that for more
than only this picture I showed you and this seems to be
working in most of the cases. So just from looking at it loudest would be what I
prefer to this point. The last thing I had to consider was modeling
the late reverberation. I said earlier I was doing that
based on the residual energy. So these guys here were selected
as audible reflections. The gray thing here this
is the room part response, sorry, for not mentioning, and the great thing is
the residual energy. I calculated the
block-wise RMS power of that energy which is the
dashed yellow line here. Then, starting at the point right after the last
reflection I included, I fitted a line to it. This is the lock, the decibel logarithmic
representations. So I can fit a line through to it. If we look at the
impulse response and the non logarithmic domain it will be an exponential decaying curve. This is how the left channel of the parametric and binaural
impulse response looks like. We have the direct sound, the couple of reflections and
then we have decaying noise. However, you see that there
is a reflection here that was discarded and we lose the energy
of that reflection if we do that. So a second approach I took, is to fit a second line here. So if I integrate the
energy from here to here, it will have the same energy
as the reflection I discard. This is how this looks
like and I will refer to this as a single and double ramp. So we have something we don’t know
if it works. Let’s have a look. Let’s have water. For evaluating it across a
large set of conditions, I rendered parametric
impulse responses using no reflections which
means only direct sound 1, 2, 3, 4 and so on up
to 15 reflections. I used all my three selection
methods first, exceed, and loudest and I used both reverberation methods
so single and double ramp. I also did that for
different audio content. I think I had speech, castanets, and noise in there. Then, I compared the parametric
impulse responses to the reference and the reference was given from the image source model decors
that has no band limitation. I’m already in quite far in time, so please ignore this. I will only tell you the output. So I looked at the energy mismatch
between the reference and the parametric
representation and it seems that the algorithm preserves
the energy quite good. In extreme cases, they might be slightly audible differences
but I think this is really neglectible for most
application in most cases. There’s only at least
from this rough analysis, only minor differences
across the algorithms. This also holds for the
spectral differences which will cause timbre differences
between or to our reference. They will also be slightly
audible in some cases and there’s minor differences between the different
parameterizations. Same holds for the
interaural level difference. This will to some degree influence
the left-right direction. So there might be
some unwanted things going on there but they
are again very subtle. I also looked at
differences between or in the intraoral cross-correlation
and what this will tell us. So the integral cross-correlation
or our auditory system uses it to get an estimate of the
source width and of the envelopment to detect if there
are reverberation everywhere. Again, the intraoral cross-correlation
is matched pretty good in this case but there will be audible
differences in extreme cases. As everywhere difference between
algorithm were rather small. So I followed up with a
Perceptual Evaluation and this will give us more
interesting results, I had a good sample size
I would say and lots of experience subjects which
means either producing, doing music or used to evaluating
music, evaluating 3D audio. So I hope to get good
and reliable answers for them and for the
listening test as always, you have to narrow
down your conditions because you can’t ask subjects
to rate for 10 hours. So I ended up using no reflection, one and six reflections in
the parametric rendering, I had an additional control
condition termed 6ISM, and what this is from
the image source model, I just selected the six
first-order reflections. So the four reflections
from each wall, the floor and the ceiling reflection, and just pretended that those
reflections were detected by the algorithm that I have and then they were applied to
the same parametric rendering. I only used the loudest method for selecting because in informal
listening to all of them, I found that this
showed best results, same for the doublet
slope method that show up slightly better results than
only using a single slope. I did the aviation for
running speech signal, an echoic speech convolved
with my impulse responses, and most important, all the
parametric representations that ran through my algorithm were rated against the
reference that we have. So these are the results. At first, I asked my subjects
to rate the overall difference. So just listen for any differences they could hear and they did that
in all the nine rooms; large, medium, small and hear the different degrees of integration. This is why there are nine boxes here and there are some
interesting trends, and here are my rendering methods. There are some interesting trends. So we see that in
general the differences between the parametric audio and the reference decrease with
the number of reflections. So you always see this slope and
maybe what I should started with, differences in most
cases are rather small. So zero means no difference and
one means a very large difference, and we see that in most cases, we are below this 0.5 line
here only in one case. We increase it by quite a bit. Then on average, so
these lines here show you the group median and these
are the confidence intervals, 95 percent confidence intervals. On average, the 6ISM method
is always rated zero, which means no difference
was perceived. In some cases, my suggested
detection algorithm is very close, in some cases, it is not. The reason being might
be the following: my algorithm doesn’t detect the ceiling reflection because
it wasn’t geared towards it, and this might be an important
reflection which we might consider. The luck effect is that
it can be easily tuned to detect this reflection by fiddling around with
the parameters a bit. Then we see that the differences decrease with the reverberation time. So if we are in an environment with very low reverberation
differences are large, if we are in an environment
with very large reverberation, differences tend to
get smaller because what we perceive is dominated
by the late reverberation. The effect of volume is as off though the room size
is a bit hard to assess, we don’t see a clear trend here
and these results are very fresh. They are not even a day
old so I didn’t run proper statistical analysis
on them that might reveal an effect of
the room. We will see.>>Question.>>Yes.>>So I suddenly announced the
experiment listening test software. I recalled a questionnaire about difference either the user
like a minus two to two scale. So how did you calibrate
across the different subjects? Did you leave everyone
on the scale that they used or did you normalize like we’ll set their max to one or 0.5 or something so that they
are normalized this way?>>Now there’s no data normalization. So for the difference, the scale went from 0-2. So no to very large difference, and I just normalized here to one. So I divided the ratings by
two, this is more convenient. Apart from that, there
is no normalization. What I tried to do in the training, I played two audio contents. One showing very large
differences and one showing very small differences
to prime the subjects. But apart from that, I did not tell them this is large, you should rate it with
the maximum differences. I left it up to them
and of course this will to some degree
increase the variance here. But there are statistical methods for the analysis that can at least
partially take that into account. I assume that it usually has an effect and we will
see that hopefully. So the next thing I did is asking the subjects for a detailed
qualitative evaluation, so for differences with
respect to timbre, tone color, left-right direction, up-down direction, perceived distance and everything with
respect to the reference. I did that only for one
room because otherwise I had to lock them in there for five hours and I did
not want to do that. But from informal listening, sorry, first things first. What we see here for almost every
and each of these qualities, the differences get smaller
as we add more reflections. You see that here. So this
is again a very nice result and again the 6ISM method is on median or on average
always rated zero, but in most qualities, my method is closer here than
in the overall difference. This is why I wrote the confidence intervals of these
guys here overlap with zero. They are generally small differences with respect to all
the tested qualities, which yes, and finally the loudness doesn’t
seem to be a difference. I included this only for validation
so for the listening test I did a loudness
equalization just based on the root-mean-square power and again differences were usually
larger than I want it to be. For even without that, the loudness wouldn’t
have played a big role, but I did that to make sure. Yes?>>For externalization, it
seems to be no difference. So your reference was the binaural
room impulse from the start?>>Yes.>>From what room impulse
response did you then extract your parameters
for your algorithm?>>From the same.>>Just from one channel?>>Sorry, yes. I should have mentioned that.
That’s a good question. So I have the spatial room
impulse response that falls out of the image source model
and to get my reference, I just convolved every of those direct sound and all reflections with
unhydrated impulse response. Then the late reverberation that has two-channel reverberation
with binaural coherence that it would expect in
addition sound field. So we have an impression of
how this works, let’s sum up. There is now an end to end
system for parametric audio. There have been many before
and there will be many after this including early reflections. It was evaluated for two
different simulation models; the image source model and
the Triton simulation and the results at least from looking at them appear to be very good matching. I did not include the
Triton simulation in the listening test for some reasons, also because of time reasons. As I said, the detected reflections
agree across those models. The early reflections seem to
be most important in dry rooms, maybe in large rooms, but that was the effect that I couldn’t really show on
the raw data at least. The differences decrease with increasing number of
reflections that we include and six early reflection seems to be sufficient in most cases. The floor reflection
seems to be important, the ceiling reflection
might be as well. So what did they contribute? We have now this algorithm that can be tuned and maybe also applied to different problems for detecting and selecting
early reflections. We have this double sloped
parametric late reverberation, which seems to be slightly favorable compared to a
single sloped reverberation. It doesn’t take too much early
reflections to trick our brain. Six reflections is not much
considering that there’s, depending on the room hundreds of them before we go to the mixing time. So hundreds of
reflections that are in the region where you
might still be able to perceive direction
and information, and the way I think about
it is the following. If we are in a room, what our
brain central auditory system essentially does is to try to suppress the reflections
because we don’t like them. If we wouldn’t suppress, then we would have a hard time
understanding each other, we wouldn’t really enjoy
listening to music in rooms, everything would be worse. So and then what the algorithm does, it tries to provide the most important reflections that the brain doesn’t really
succeed in canceling out, and these reflections provide
the sense of sorts with distance or the qualities that we saw and for the other reflections, you could see it as just
providing the brain with the cues it would
end up with anyhow, which is the correct energy
at the correct time, and the correct
cross-correlation for example. The pipeline I did might be included in the Triton work-flow with some work especially on the
real-time low memory things. But I think in general
that would be possible. Currently, Triton, as far as I know, uses Finite Impulse Response for
the late reverberation because it has some generic peaks in there
to mimic early reflections. This approach here with the
two slopes I think should be implementable as infinite
impulse response. So being less computationally demanding and hopefully of
higher perceptual quality. I think there’s some
evidences there although agreeably differences might be small. But considering that it
doesn’t cost too much, it might be worth a shot. For the outlook, of course, we want to model non-empty rooms. We are barely in empty rooms, and this is an example for this. I simulated this room in Triton and these are the
reflections I detect. Again, I think at least
for the early part, it looks somewhat sane but there
has to be more evaluation to show that we wanted to have a
spatially smooth distribution. So if I jump back and forth, you see impulse responses generated at different distances in the room. If you look at how the
reflection changes, they are one meter apart
relatively close to the source. So there can be some changes, and you hopefully see that those
dots here move in a way that might be perceptually smooth
if you imagine you have more measurement positions or
simulation positions in between. But also, sorry, if you
look at this guy here, this is a reflection that becomes audible if I change the position. So due to the definition
of a threshold, and I don’t really think
there’s a way around it, there will be reflections that appear and disappear
depending on your position. Because they appear and
disappear at the threshold, so at the point where the
contribution isn’t so important, my first idea would be
that we should not be able to detect deleting and inserting
those reflections but that, of course, has to be proven. We want, of course, to apply this for outdoor
environments as well. This poses the problem that we
have to simulate a larger volume, which means we have a lower
simulation frequency because otherwise our memory would explode. However, the good thing
is, in these environments, we can also expect more
spars reflections, or more spacing between them. So a lower cut-off frequency might
not be such a drawback there. What the results also tell us, differences were largest when there’s low reverberation and this is usually the case in
outdoor environment. So it might be important here to try to get our new
reflections correct. Then the parameter rendering, it was a very basic thing. There might be things
that can be improved. It might be a good idea to discover the first-order reflections
because there’s six ISM method performed best. But then, as we enter a room
that has a desk, has a chair, has people, who knows
if that’s to the holes? So this also deserves more attention. We want to do for short memory
but I think it can be done. For now, my algorithm just
ran for 200 milliseconds, which will be hardly necessary. I just panic, ran it
there because I had time. So we might want to find a
measure to tell us I’m done, basically, stop doing this. You have enough reflections, and there might be approaches to it. It might also be interesting
to not only consider the direction of reflection but also consider its directional spread. For example, if I ducked behind my desk and then talk
you can still hear me. But now the sound is coming from around this thing and I might sound a bit wider and I might also
stand behind a large wall. So including the direction of variance just by means of calculating spatial variance and then
considering the rendering, which might be done by not
using a single point HOTF, but with pre-computed set of HOTFs that try to mimic
a certain spread. Of course, it’s not
frequency-dependent yet and it doesn’t have
directional light reverberation, which in auto environments
they might becoming more reverberant from an open
door than from the forest, that is, to the other side. But all these things, there’s already methods
to it to use that. So thanks, Ivan, for having me and Hannes for mentoring
and critical feedback. I hope you have a good wedding. They don’t marry, but they’re
both at the same place. Then Nikunj for feedback
and help on using Triton. Same for the entire Acoustics
Team on feedback and help there. Thanks to the Acoustic Groups and all the interns for
the good time I had. Yes, thank you for participating
in the listening test. I know it was hard, but I hope that you agree that you did a
valuable contribution to this.>>So I just wanted to share some related experiences we
have that might help you think through additional ways
to talk about your results in working with users in real-world who are using spatial
audio performance that we’re building and found that the perception of the height of an object in
an augmented reality or virtual reality space is
difficult to tell sometimes. But then, we narrowed it in on
specifically not simply the, say, precisions like plus or minus
5-10 degrees or something, but even something much more
basic which was the ability to tell the difference between something above the horizon and
below the horizon. We frequently encountered really just this basic feedback that
said, everything sounds up. Nothing sound below. I can’t even just
hear anything below. So in order to
investigate that further, we did do a small pilot study where we did a two choice
discrimination task, is this sound source above or below? So in your results, I saw that, and you
mentioned that, yes, the effect on the vertical precision or accuracy might have been
listed as relatively small, but I wonder if you zero in on specifically two choice
discrimination between up or down you may find more
dramatic difference, the ability even to just
hear anything below, at least we found out with a
few subjects and I have been curious whether a large
study might even that out.>>Yeah, that’s a
really important point. So I did not do that in my study because it’s all rated
relative to the reference. So we only see that there’s almost no mismatch in the up-down directions as soon
as we have one reflection. In theory, the sound was coming
from directly in front of you. So zero degree elevation. But it’s still might
have been the case that subjects perceived everything
is coming from 10 degrees up. So I don’t know that. In this case, the possibility
that we have at least, in virtual reality, we have the
visual representation which helps. Apart from that, it gets
difficult to achieve it, I think, in a physical correct way. Even if we have highly
individualized HOTFs, this might still be a
problem if we don’t see anything because then we
consider heights out of you. It might be there, I don’t see it. It can’t be there because
I’m standing on the floor. I think that might be
an effect in play here and I think two guys from
the Graz University, Mathias Funk and France Sata, at a small German Conference, they had one paper where
they tried to go around that by simply applying a high
shelf filter to the HOTF. So reduce high frequency content if something is above and increase
it if something is above, sorry, reduce it if it’s
below increase it if it’s above, work quite well. I think that’s also a nice
way of thinking about it. We as engineers and we tried to understand everything,
we have to do this. I’m very prone to that. But if we go to user interaction, the way to a good solution
might be sometimes shorter if we just go for effects.>>Yeah. So I found localization
is mainly only spectrally. So that’s the [inaudible]
that has this in this book and it basically does a region
around 12 kilowatts I think. So just put a filter increasing 12 kilowatts
and you will perceive. So it’s above, it’s very easy. Humans are just really bad, so you have a really large
confusion between up and down localization because I think
we’ve just tried to do that. Usually, it’s a very rare situation that you have something
coming from the floor. So usually, these sources are above. It’s rare that you stand on a chair or something and some
source comes from below. So it’s a rare event
that we perceive. So I think we just have
trained to all these events.>>From a VRAR point of view, I would say that it’s
a feature not the fact that you can put some visual things. There’s the priors if you had something on top
but at the same time the variance is so big that anything
can actually be aware of it. So that’s something that’s more audio visual
perception [inaudible].>>A very stupid way to solve it is to ask the
subjects to roll their head. I guess that’s it.
Thanks for listening.

One thought on “Efficient and Perceptually Plausible 3-D Sound For Virtual Reality

Leave a Reply

Your email address will not be published. Required fields are marked *