Exercise Session 1
Listening Tests & Basic Speech Processing
The goal of this session is to make you familiar with some aspects of speech acoustics (loudness, timbre, echoes, frequency sensitivity) with the aid of some listening tests and some signal processing tools. We also introduce some basic speech analysis aspects like digitizing, framing and short time frequency analysis. In this session the focus is on intuition. A more rigorous mathematical analysis and more advanced speech analysis tools will be presented in the next session.
Some listening tests and demonstrations are presented with a number of questions. There are two kinds of questions: basic questions and more advanced (technical) questions. The basic questions are preceded by a filled disc, while the advanced questions are denoted with a square. The advanced questions are "optional" and some require an engineering background. Read them carefully and take the necessary time to write down your answers to these questions.
The first part of this exercise session consists of some listening tests. They are selected from an audio CD with auditory demonstrations created by the Institute for Perception Research (IPO) in Eindhoven, The Netherlands.
Make sure that the audio
device is configured properly before you start the demonstrations. Please ask
for help if problems occur.
1. The full dynamic range of the auditory system is about 120dB. However we need much less for our day to day usage. From the above examples, what level difference can you have between a soft and loud speech signal such that they are both comfortable and understandable without excessive effort from the listener?
Psychoacoustics refer to the interpretation of sounds by human hearing system. Hearing involves a behavioral response to the physical attributes of sound including intensity, frequency, and time-based characteristics that permit the auditory system to find clues that determine distance, loudness, pitch, and tone of many individual sounds simultaneously. The perception of sound or music might be dissimilar for different individuals. In this exercise you explore three features: Loudness, Pitch and Timbre. In addition to some listening task, you do experiments with some speech and signal tools to discover which signal characteristics are responsible for which percepts.
Loudness is a subjective perception of the intensity of a sound, in terms of which sounds may be ordered on a scale extending from quiet to loud. In Exercise 1, you heard different sound files with different intensity which is represented by “dB”. The intensity refers to the energy in the sound (or the amplitude of the sound) and is described with the sound pressure level (SPL, generally in dB). The SPL is a pure physical unit.
For pure tones (sinewaves) the relationship between both units is given by the curves of equal loudness. By definition, the equal-loudness curve for x phons connects all sounds with different frequencies that are perceived equally loud as a sound of x dB SPL at 1000 Hz (1000 Hz was chosen to be the reference frequency). In this tool you can construct some equal-loudness curves yourself.
Pitch is the percept that lets us rank signals on a musical. For pure tones, pitch corresponds very well to the frequency of the tone. For complex tones, pitch typically corresponds to the main periodicity observed in the signal. As this may not be unambiguously resolved, the ear has a preference for a pitch (interpretation) in the range of the human voice.
Type "timedom" to launch the demo. When the window appears, use the load menu to create signals. The opening page provides you with five options for the type of signals: tone, noise, impulses, harmonics, and resonances. For each you can set some parameters to have different shapes. It is also feasible to add signals together using "add" bottom.
Timbre can be defined as "that attribute of auditory sensation in terms of which a listener can judge that two sounds similarly presented and having the same loudness and pitch are dissimilar...". According to this definition, timbre is the subjective correlate of all those sound properties that do not directly influence pitch or loudness. These properties include the sound’s spectral power distribution, its temporal envelope, rate and depth of amplitude or frequency modulation, and the degree of non-harmonicity of its partials. The timbre of a sound therefore depends on many physical variables.
In the following demonstrations, one can hear the influence of spectral make-up on the perceived timbre of sounds of musical instruments.
Make your own signal. Again launch the "timedom", and go the creating signal page.
First make a toy signal of a tone of 300 Hz and listen to it. Then add the resonance with bandwidth of 20Hz and same frequency. Listen again. The resonance signal plays the role of attack somehow.
Down-sampling: Go to the signal and "create" the tone. Set duration on 1000 ms and sampling frequency to 8000 Hz. So there are 8 samples in each msec. Try a pure tone of 3000Hz, 4000Hz and 5000 Hz. The top right figure shows the frequency domain of signal. Create a signal with five harmonics and 150Hz for the fundamental frequency. Set the duration on 100ms, and select "done".
The last listening test, i.e. the Audiometer, and the whole second part is based on a set of speech analysis tools (MAD: Matlab Audiotry Demonstration) developed in MATLAB by the Speech and Hearing Research group at the University of Sheffield (United Kingdom).
You need to open MATLAB with this command: "~spchstud/SP". For the demonstrations in MATLAB, instructions will be given in the exercise session on how to set up these tools. Execute the commands that will be proposed in the text in the given MATLAB window.
Audio Configuration
Exercise 1: The Decibel Scale
Things to investigate
2. Explain the difference in acoustic behaviour between an anechoic room and the computer lab that you are in. Would you dare estimating the SPL differences when someone is talking to you from 25, 50, 100 and 200 cm in this room?
3. What is the theoretical dynamic range that can be captured using 16bit quantization as on a standard CD recording ?
Exercise 2: Psychoacoustics
Introduction
Part one: Loudness
Introduction
This demonstration discusses the relation between the intensity and the loudness of a sound. How loud the sound is perceived by a human, the loudness of the sound, is described in phons. The phon is a psychophysical unit.
Part two: Pitch
Introduction
In the top-left panel, the signal appears and in the top-right one you observe its spectrum which will be discussed later on. Set sampling frequency 8000.
1- Make a pure tone with following features:
Frequency=200Hz, Phase=0, and Amplitude=70dB
Set the "duration" to 1000 ms and add the signal to the panel. You can listen to your signal by clicking anywhere on the signal.
2- Make the resonance signal with the same frequency and amplitude as before. set the bandwidth to 1 Hz and listen to the signal in the same duration. Redo this with increasing bandwidths of 2.5 and 10 Hz. What do you observe?
Note that you hear the same frequency and your perception is the pitch.
Do you perceive different pitches? Which one is higher? You may repeat the listening task again and rapidly.
3- Go back to the first signal. Now add in sequence a number of additional frequencies: e.g. 400 Hz, 300 Hz and 250Hz. Choose somewhat different amplitudes in the range of 50 to 80dB. - Listen to them. How does the pitch change ? - What's the period observed with each of these signals (this will be easier to observe if you create the same signals again, but only for a shorter time window, e.g. 100msec) - Is there any difference in the pitch you observe when you change the amplitudes of the respective components ?
Part three: Timbre
Introduction
The concept of timbre plays a very important role in the orchestration of traditional music and in the composition of computer music. There is, however, no satisfying comprehensive theory of timbre perception. Neither is there a uniform nomenclature to designate or classify timbre. This poses considerable problems in communicating or teaching the skills of orchestration and computer score writing to student-composers.
Demonstration
Make five harmonics for 251Hz (duration=1000ms). Listen to the signal again. increase the number of harmonics and listen again. use the full range going from 1 to 20 harmonics.
B) Effects of Tone Envelope on Timbre
Demonstration
C) Effects of Reverberation
Demonstration
Exercise 3: Digitized Speech
Things to investigate
1- Listen to each signal. In which case(s) do you think we are loosing the original information?
2- Keep the 4KHz signal and change the sampling frequency to 8192Hz. Listen again. What's the peak in the frequency diagram?
Exercise 4: Time Domain processing
Things to investigate
1. For 3 sizes and one shift, try 20, 30, and 100 for the size of window while the shift is 10. How does the structure of the plots alter when the window sizes decreases/increases? Why? Compare rectangular window with hamming.
2. For hamming window select "1 size, 3 shifts". Keep the size 30, and shifts: 2, 15, and 30. What is the influence of the window shift on the shape and information contents of the energy plots?
3. Load the speech signal 2.wav. What do you notice about the energy and average magnitude
plots? Do they reveal any information on the underlying phonemes?
4.Generally, the short-term energy is part of the preprocessing parameter set of a recognizer. Which values for window shift and size do you suggest? Why would you use overlapping frames?
5. Some speech events are very short (e.g. plosive bursts). Can the window size be larger than those shortest events? Don't we lose information?