When PEARL was asked to create a sound for the acoustic vehicle alert system (AVAS) being required for e-scooters in London, it was clear that it would be an example of the way we approach our work at PEARL. This was not to be just the creation of any sound, but one which would work well with how human hearing works, and what it does best. This post is about how we set out to investigate this problem, as an illustration of our approach to our research. We also research challenges around the mixed use of the urban space and the challenges this offers for people with different capabilities - but that is a story for another occasion!
The human hearing system has evolved for the savanna, which, as you can see from the image above, is characterised by big wide horizons, little vertical structure - and a very low background sound level (about 20dBA). In that environment, Homo sapiens succeeded if they could identify a tiny sound - say, the snap of a twig - at a considerable distance, and located somewhere around and more or less at ground level. That snap might indicate the presence of a predator or a meal - either way, survival depended on detecting it in time to take the necessary action. The formation of the successful hearing system is geared to this challenge. It was not necessary to have a hearing system that could hear very high or very low frequencies, or very loud noises (because there weren't any).
So we listened to the sound of a twig snapping. What does it consist of? Well, it starts with a short high-frequency 'crack' sort of sound, lasting just a few milliseconds, followed by a slightly longer broadband sound covering multiple frequencies, and then it stops. It is not very loud either. Yet it could signify a meal, or becoming a meal, so very important. The brain takes this tiny signal and passes it through a special part of the brain called the superior colliculus where datastreams from both vision and hearing are processed together - this is how you know where to look when you hear a sound that is outside the visual field - and then it passes into the cortices to sort out what kind of sound it is (and whether its source looks like a dinner or a diner). This system works extremely well in the context of a fairly constant and quiet background, where such a tiny sound would be quite distinctive. How does this play out in the context of an e-scooter in a busy 21st century city?
The first difference is that constant background sound. In most cities, including London, this is a low-ish rumble, with a basic overall sound level of around 65 dBA so much louder than the savanna, but the frequency range is quite consistent: between around 20 Hz and 2,000 Hz, with a peak loudness at around 1,000 Hz. This meant that we needed a sound that could be heard against that 65dBA background and that should take into account the loudness distribution.
If you want to attract the attention of someone without making too much noise, you might make a sound a bit like "Psst". This sound has similar characteristics to the snapping twig - it has a short well-defined start (we call the start an "onset") - the "P", a broadband sound for a little longer (the "ss" - called the "body") and a well-defined end (called the "offset") "t". If you wanted to repeat the sound, there would be a short gap as you re-form the "P" to start the repeat. What about making a sound with similar characteristics?
We tried a large number of different sounds with different characteristics, including these, using the ability of a person to locate the sound as a proxy that would indicate to us how well a person could distinguish the sound. The "Psst"-like sounds worked better than the others, but for a really improved performance we added a new idea. Two ideas came together for this.
First, the human hearing system is very good at locating sounds by evaluating the difference in arrival time of a sound at each ear. It is possible to trick tis capability though. If we play two sounds simultaneously, so that the person hearing them is positioned exactly the same distance from the two sound sources, they will believe the sound is located halfway between the two sources (this is the basis for stereo audio systems). If one sound is played 0.1 milliseconds before the other, the person will think the the sound is coming from a source very slightly nearer to the source that sounded first. As the difference is increased, up to about 1 millisecond, the perceived location moves towards the source that played first. At 1 millisecond difference most people will say that the sound is coming from that sound source. They will continue to believe that until there is a 3 or 4 millisecond gap between the two sounds, when they will realise that there are in fact two sounds (until this point, they will only 'hear' one sound).
Secondly, we realised that the "Psst" sound includes a gap, between the "P" and the "ss", as air is allowed to escape from the mouth as the lips part in order to produce the "P" sound. As it is not possible to do that and produce a "ss" sound at the same time, there is a very tiny gap of silence.
So we put a 2 millisecond gap between the onset and the body of the sound. When we did this, the detection rate of the sound was much higher.
We also used the loudness distribution of the urban background to our advantage. We could set the frequency range for the body of the sound (the "ss" bit) so that it would lie in a part of the frequency spectrum where the background sound was less loud. As the human hearing system can detect sounds across the whole of its audible spectrum (between 20 Hz and 20,000 Hz in a healthy young person, maybe 20 Hz to 8 Hz in an older person) all the time (this is how we could be alerted to twigs snapping in good time). Parts of the spectrum that are not occupied by the background noise could be available for the scooter sound. This means that the sound does not have to be so loud. In fact we found that people immersed in the urban background noise (you can see the example of the acoustic 'dome' we used to simulate actual urban noise in 3 dimensions in the photograph on the right above) could hear the e-scooter sound even when it was playing more quietly than the average background sound level.
Because the start of the sound is so important for its detection, we need to have lots of onsets, so the sound has to be repeated. If the repetitions are too slow, it would not act as an alert (think of how relaxing the "tick tock" of a grandfather clock sounds). If it is too fast it would raise anxiety too much. So we set it at 4 times per second - approximately the resonant frequency of the human nervous system - so that it would feel important but not worrying. The people who participated in these experiments included a wide range of different sensory capabilities, and we examined both their reported sensations and how their body and brain responded to the sounds.
So this is how we "synchronised the vibes" around an alert system for an e-scooter. Now it is being tested to see if it can be produced in the harsh world of real challenges in urban areas. It is illustrative of the approach we take at PEARL - we start with the person and how they engage with the world and then work out how we can make that as sympathetic as possible.