Speech Analysis Tutorial


Phonetics is part of the linguistic sciences. It is concerned with the sounds produced by the human vocal organs, and more specifically, the sounds which are used in human speech. One important aspect of phonetic research is the instrumental analysis of speech. This is often referred to as experimental phonetics, or machine phonetics.

The instrumental analysis is performed using one or many of the available instruments. These include X-ray photography and film, air-flow tubes, electromyography (EMG), spectrografs, mingografs, laryngographs etc. The aim for most of these methods is to visualize the speech signal in some way, and to try and capture some aspects of the speech signal on paper or on a computer screen. Today the computer is the most readily available and used tool. With the computer the analysis process is much simpler and usually faster than with other tools, however, it does not necessarily produce a result of higher quality.

In this tutorial we will look at and try to explain the most common ways of speech analysis and visualization. The computer used is a Sun SPARCStation workstation with a Loughborough C30 board and the software analysis package is the ESPS/Waves+ package. Speech signals are recorded with a microphone and then digitized at 16 kHz sample frequency in the computer. NOTE: The sounds in this demo are downsampled to 8 bit, 8000 Hz. You may want to widen you browser's window to view the images.


Physically the speech signal (actually all sound) is a series of pressure changes in the medium between the sound source and the listener. The most common representation of the speech signal is the oscillogram, often called the waveform. In this the time axis is the horizontal axis from left to right and the curve shows how the pressure increases and decreases in the signal. The utterance we have used for demonstration is "phonetician", American English, spoken by a male adult. The utterance has been transcribed using the IPA phonetic alphabet, which is the most commonly used. The signal has also been segmented, such that each phoneme in the transcription has been aligned with its corresponding sound event. Note that the nine vertical lines are not part of the speech signal, it is the segmentation points.


Another representation of the speech signal is the one produced by a pitch analysis. Speech is normally looked upon as a physical process consisting of two parts: a product of a sound source (the vocal chords) and filtering (by the tongue, lips, teeth etc). The pitch analysis tries to capture the fundamental frequency of the sound source by analyzing the final speech utterance. The fundamental frequency is the dominating frequency of the sound produced by the vocal chords. This analysis is quite difficult to perform. There are several problems in trying to decide which parts of the speech signal are voiced and which are not. It is also difficult to decipher the speech signal and try to find which oscillations originate from the sound source, and which are introduced by the filtering in the mouth. Several algorithms have been developed, but no algorithm has been found which is efficient and correct for all situations. The fundamental frequency is the strongest correlate to how the listener perceives the speakers' intonation and stress.

In the picture the fundamental frequency (often called F0 to be coherent with the terms for the formants, F1, F2 etc) is plotted against time. The F0 curve is visible only at points where the speech is voiced, i.e. where the vocal chords vibrate. The values for F0 lie between 100 and 150 Hz. This is common for a male speaker. The typical F0 range for a male is 80-200 Hz, and for females 150-350 Hz. Naturally, there is great variation in these figures.


According to general theories each periodical waveform may be described as the sum of a number of simple sine waves, each with a particular amplitude, frequency and phase. The spectrum gives a picture of the distribution of frequency and amplitude at a moment in time. Note that this picture does not have a time scale. Instead, the horizontal axis represents frequency, and the vertical axis amplitude. If we want to plot the spectrum as a function of time we need a way of representing a three-dimensional diagram, one such representation is the spectrogram.

The picture shows the spectrum 0.15 seconds into the utterance, in the beginning of the "o" vowel.


In the spectrogram the time axis is the horizontal axis, and frequency is the vertical axis. The third dimension, amplitude, is represented by shades of darkness. Consider the spectrogram to be a number of spectrums in a row, looked upon "from above", and where the highs in the spectra are represented with dark spots in the spectrogram.

From the picture it is obvious how different the speech sounds are from a spectral point of view. In the unvoiced fricative sounds, the energy is concentrated high up in the frequency band, and quite disorganized (noise-like) in its appearance. In other unvoiced sounds, e.g. the plosives, much of the speech sound actually consists of silence until strong energy appears at many frequency bands, as an "explosion". The voiced sounds appear more organized. The spectrum highs (dark spots) actually form horizontal bands across the spectrogram. These bands represent frequencies where the shape of the mouth gives resonance to sounds. The bands are called formants, and are numbered from the bottom up as F1, F2, F3 etc. The positions of the formants are different for different sounds and they can often be predicted for each phoneme.

Press each phoneme in the image or in the transcription to hear the corresponding sound.


The waterfall spectrogram is another way of viewing the three-dimensional plot of time, frequency and amplitude. The picture is looked upon diagonally with time and frequency along the bottom axes. The amplitude is visible for each spectrum. The formants can be seen as sequential high points in each spectrum.


Normal writing (orthography) is one way of transcribing speech. Since many years however phoneticians have used other alphabets for transcribing speech. These alphabets try to maintain a close relation between the characters printed and the actual sound. The most widely used alphabet is the IPA (International Phonetic Association) system. This originates from the late 19th century, and the latest revision was in 1989. The IPA system tries to have a character for each phoneme in every human language and diacritic marks for variations on these.

A transcription may be more or less narrow. It may be a transcription of how a word is generally uttered in a particular language, or it may try to capture the actual variation in how a person said something at one particular occasion. Which level the researcher chooses to work at depends on the aim of the research.

Press each phoneme in the image or in the transcription to hear the corresponding sound.


Listen to the whole utterance, "phonetician".


This page is very much inspired by the Speech Visualisation Tutorial, Leeds University.



Denes, P.B., Pinson, E.N., The speech chain, 1973
Ladefoged, P., A course in phonetics, 1982.
Catford, J.C., A practical introduction to phonetics, 1982.


Elert, C.-C., Allmän och svensk fonetik, 1995
Lindblom, B., Nationalencyklopedin, band 6, 1991
MacAllister, R., Talkommunikation, 1994
Marcus Filipsson, 951211