Hisako Murakawa and Stephen Lambacher
Japanese learners of English often experience great difficulty in pronouncing an acceptable English [r] sound. A computer visual display can be effective as a teaching and learning tool to help Japanese improve their pronunciation of English [r]. A visual display familiarizes learners with the frequency levels of sounds by enabling them to associate the patterns on the screen with the sounds they are producing. This paper reports on the results of a test administered to 40 Japanese university students studying how to pronounce an acceptable American English [r] sound using Electronic Visual Feedback (EVF). Post-test results show general improvement in production of American English [r] after students were instructed and exposed to cross-linguistic data of the target sound using EVF.
Trying to differentiate between [r] and [l], let alone pronounce an acceptable English [r] sound, can be a nightmare for a majority of Japanese learners. One problem for Japanese in learning a contrast such as [r] and [l] is L1 interference. (Flege, 1995; Lively et al, 1993; Best and Strange, 1992). In the Japanese sound system, the [r] and [l] sounds are not distinguished as separate phonemes as they are in English.
A computer visual display (hereafter referred to as Electronic Visual Feedback or EVF) can be effectively used as a tool to help L2 learners improve their pronunciation of the English [r] sound. Researchers agree that a combination of auditory and visual feedback can be effective in teaching sound segmentals, suprasegmentals, and other aspects of pronunciation, (de Bot, 1981; Molholt, 1988; and Anderson-Hsieh, 1992). Computer programs can provide the language learner with real-time information about the salient acoustic properties of pronunciation. By showing the exact features that need changing, a visual display can instantly provide an objective measurement by which students and teachers can evaluate and assess students' pronunciation errors and progress. The main advantage of EVF is that it allows students to visualize their pronunciation by associating the patterns on the screen with the sounds they are producing.
This paper focuses on the use of EVF in helping Japanese learners improve their pronunciation of American English [r] (hereafter referred to as AE [r].) The post-test results show general improvement in production of [r] after students were instructed and exposed to EVF training. This paper is also a follow-up report to a previous article by Murakawa (1995) that describes the main features of the Language Media Laboratory (LML) for teaching English pronunciation.
,
,
) of Japanese
production of the target sounds were examined and
compared. Formant frequencies
were calculated from an FFT frequency response with a preemphasized,
low-pass filter. A sampling-frequency of 19 kHz
with a frequency range between 0 Hz and 5,000 Hz was used.
The sound analyzer can perform an acoustic analysis of speech with functions for viewing amplitude, intonation, and pitch, and for viewing and measuring duration and frequency. The main component is the spectrogram, which was the main teaching tool used in EVF instruction. Using a spectrographic display, students recorded and visualized their pronunciation, learning to interpret the basic acoustic patterns of the [r] sound within various contexts.
Cross-linguistic (English and Japanese) data samples containing AE [r] were presented for students to visualize the acoustic differences between the sounds on their monitors as produced by both native speakers and Japanese speakers, and to compare their own production with the teacher's. Native speaker recordings of words and sentences were provided for students to practice recording and analyzing through sound files copied and stored into the computer network system, and from the teacher who could model any sound, word, or sentence and electronically send it to each workstation.
Teacher feedback and instruction of the [r] sound was provided in the classroom via a headphone/mic system while students practiced recording and analyzing the target sound. During the training period, students were allowed to visit the laboratory during after-school hours to independently practice their pronunciation of the target sound using the sound analyzer. See Murakawa (1995) for a detailed description of how the Language Media Laboratory (LML) training system is used for teaching English pronunciation.
According to Flege et. al. (1995), Japanese [r] is often articulated between an English [l], [r], [d], and sometimes [w]. When Japanese produce an AE [r] as [l], the tongue tip touches the roof of the mouth. In an accurate AE [r], however, the tongue does not make complete contact with the roof of mouth but produces a narrow passage resulting in consonantal obstruction, (Chomsky and Halle, 1968.) Vance (1987) describes Japanese [r] as an apico-alveolar tap and palatal when pronounced with [i] and [y]. Jones (1967) refers to Japanese [r] as a flap sound.
) frequency below 2000 Hz, which
makes it easy for students to identify on
a spectrographic display. Flege et. al. (1995)
report the following formant frequencies for [r] (measured at the
point of onset of one or more formants, in Hz) by native
speakers of English: F1 456; F2 1121; F3 1750.
The formant values of Japanese speakers
are higher: F1 470; F2 1311; F3 2261.
AE [r] has a low
value which has an affect on neighboring
vowels. The lower
also results in
and
being
lowered. Within word-initial positions AE [r] is not
influenced much by the following vowel, and the
reaches approximately the same value for all vowels in
approximately the same amount of time (Olive, 1994).
Figure 1 shows a spectrogram of the word read as recorded by a
native speaker of English. Notice how the first three formants are
parallel to one another in a low position (
341;
1101;
1550). This formant structure is typical of AE [r] when it
is articulated with rounded lips and a narrow
groove. (See Murakawa, 1992).
The tongue position is very important when determining the acoustic properties of AE [r]. AE [r] is produced with a curved tongue that is raised toward and touching the alveolar ridge only at the sides of the tongue. These articulatory influences are clearly visible on a spectrographic display.
The
value of AE [r] is low, closely positioned to
, and
rises as a result of the vowel
transition. (See Figure 1).
Although the
and
values of
[r] are similar to those of [l], the
value of [l] is about
1 kHz higher. (Kent, 1994). [r] also differs in that is
has a short steady state and long transition compared to
[l] which has just the opposite structure.
Figure 1. A spectrogram of the word ``read'' as recorded by a English native speaker.
A Japanese speaker typically articulates AE [r] with a slightly flat tongue and the tip touching on or near the alveolar ridge or in a prepalatal position. This alters the formant structure of AE [r] and results in it more closely resembling an [l] or flap sound.
Three errors of AE [r] produced by Japanese speakers are illustrated in Figures 2 through 4. A common error made by Japanese is the flap sound. (See Figure 2.) A flap occurs when the tongue tip is briefly held against the prepalatal area and is suddenly released. There is a clearly distinct fundamental frequency line at the baseline which indicates the presence of voicing just before the tongue begins to release. Notice also the clearly visible plosive line that occurs vertically right before the vowel transition.
The three formants of [l] are shown in Figure 3.
The first two formants occur parallel on the baseline
(
417;
987) while the
value is much higher (3417).
[w] can also substituted for AE [r] by Japanese speakers.
Notice in Figure 4 how the
value of AE [r] when produced as a
[w] increases by more than 1,000 Hz to nearly 3,000
Hz.
Figure 2 Figure 3
Figure 4
Figures 2-4. Spectrograms illustrating
three common errors of word-initial AE [r] as recorded by a native speaker of
Japanese. All three spectrograms are of the word
read.
The data results of student production of word-initial AE [r]
are shown in Table 1.
levels were
used as a measurement to indicate either success or failure in
pronouncing AE [r].
Four separate
``Levels'' are listed at the top left
of the table. Each level is considered an accurate production of
[r] to varying degrees within a decreasing order.
If a student's production fell within the range of Level 1, the
value of [r] was between 1400 and 1700 Hz and the
attempt was considered near
native-like. Each succeeding ``Level'' contains utterances with
a decreasingly less accurate
frequency outside the target of Level 1 (plus
or minus 100 Hz) and an increasingly
less native-like [r] production.
Hence, in Level 2 is listed the number of [r] productions
with an
value between 1300 and 1399 Hz or 1701 and
1800 Hz; Level 3 is between 1200 and 1299 Hz or 1801 and 1900 Hz.
Level 4 includes utterances with an
value of (>1900 Hz) that were acoustically closer in
resemblance to [r] than to any other sound.
The number of student errors are listed at the bottom of Table 1.
An utterance was considered an error if its
value was higher than
2100 Hz.
The three most common errors of word-initial
[r] were the flap, [l], and [w].
Notice that only one native-like [r] was produced in the pretest. After EVF training, however, the number of successful attempts in Level 1 increased to 11. The number of utterances in Level 2 increased from 2 to 18 in the post-test. As a result of the 15 utterances in Level 4, it can be assumed that these subjects neither rounded their lips nor made a narrow groove; but they at least succeeded in not touching the alveolar ridge or prepalatal region with their tongue.
The number of flap substitutions for AE [r] decreased from 38 to 3 in the post-test. On the other hand, many students still substituted [l] for AE [r], as evidenced by a decrease in the [l] sound from only 25 to 15. Overall, the total number of successful productions of word-initial [r] within all four levels increased from 14 to 57, and the total number of errors decreased from 65 to 22.
Table 1. Student Production of Initial AE [r]


Table 1. A comparison of results of students' production of initial AE [r] from the pretest and post-test.
Figure 5 Figure 6
Figures 5 and 6. Spectrograms showing improvement in student production of [r] within the word read. The spectrogram in Fig. 5 is from the pretest and the spectrogram in Fig. 6 is from the post-test.
[r] cluster is one of the more difficult contexts for Japanese to produce the [r] sound, which is due to the fact the Japanese sound system contains no word-initial consonant clusters. Another reason is that the duration of AE [r] is much shorter when occurring in a cluster, making it harder for Japanese learners to perceive and produce accurately.
[r] cluster is more difficult to measure acoustically
than word-initial [r].
Its acoustic properties are similar to those of [r] in other environments,
(a low
emphasis with a sudden rise in the
and
formants just before the vowel transition). But [r] cluster
differs in that it has a shorter steady state,
and the
and
values rise rapidly making it
more difficult to detect on a spectrogram. (Sheldon and Strange, 1982).
Figure 7
Figure 8
Figures 7 and 8. Spectrograms of the word programming as recorded by both a native English speaker and Japanese speaker, respectively.
The evaluation of [r] cluster during both the pretest and post-test was based on either an existence or non-existence of the [r] pattern, rather than a measuring of the formant structure as in the other contexts of AE [r]. Figure 7 shows [r] cluster within the word programming produced by a native speaker of English. [r] occurs within the first two syllables of the word, after the aspiration of [p] and after [g]. Figure 8 is a spectrogram of the word programming which is pronounced as [pu ogu amingu] by a Japanese speaker.
The pretest and post-test results of [r] cluster are listed in Table 2. The number of successful productions are shown at the top and the number of errors are listed at the bottom. The most common error is the flap sound followed by [l]. Notice that the total number of successful productions of [r] cluster mote than doubled from 52 to 105 in the post-test. A total of 77 flaps sounds were substituted for [r] cluster in the pretest, but only 26 in the post-test. The 29 [l] substitutions produced in the pretest did not decrease in the post-test, however.
Table 2. Student Production of AE [r] cluster


Before analyzing student production of vowel + [r],
we will briefly examine the phonological and acoustic
structure of
Japanese vowels. The Japanese sound system
contains the five phonemically distinct vowels
[a,i,u,e,o]. Japanese learners often substitute a long vowel sound for AE
[r] when it occurs after a vowel, including the vowels
[a] and [o].
Japanese vowels are typically short, but they can
also be produced as long vowels within certain contexts.
In order to highlight
the spectral movements of the first three formants,
particularly the
frequency of the vowels [u] and [o],
Figure 9 shows [a,i,u,e,o] as produced
continuously by a native speaker of Japanese.
Figure 9.A spectrogram showing the Japanese vowels [a, i, u, e, o] produced continuously by a native speaker of Japanese to highlight the movements of the first three formants.
Figures 10 through 14 show spectrograms of the five
Japanese vowels. In Figure 10 the first three
formants of [a] are: (
911;
1367;
2924).
The tongue is in a low position in a vocal tract configuration
of [a], resulting in an
value of approximately 3,000 Hz
and an
value of close to 1400 Hz.
When Japanese [a] is substituted for [r] +
vowel, the acoustic differences between the two sounds are easy to
visualize on a spectrographic display.
The high vowel [i] is produced with a tongue position
close to English [i] which results in a high
and
frequency. Figure 11 lists the following formants
for [i]: (
341;
2840;
3721).
The high-back vowel [u] is articulated with the highest
point of the tongue assuming a more
forward position than a typical English [u], and also
involves slightly less lip-rounding. Figure 12 lists
the following formants for [u]: (
341;
1101;
2810).
The mid-front vowel [e] is pronounced similar to English [
],
as in the word pet. Japanese [e] has a very similar
value as Japanese [a] (around 3,000 Hz) and also has an
of about 1,000
Hz higher than that of [a]. Figure 13 lists the formants of [e]:
(
645;
2354;
3113).
Japanese [o] can vary in articulation, but it is usually
produced between an English [ou] and [o]. [o] has a
high
value like other Japanese vowels but is
distinguished by its low
and
values. Figure 14 lists the following
formants for [o]: (
569;
987;
2924).
Figures 10-14. Spectrograms of Japanese [a, i, u, e, o] as produced by a native speaker of Japanese. Figure 10 [a] Figure 11 [i]
Figure 12 [u] Figure 13 [e]
Figure 14 [o]
The formant values of [r] are generally higher in post-vocalic
than in word-initial positions, particularly the
frequency.
Post-vocalic [r] often influences
the preceding vowel by changing its formant structure, which results in an
[r]-coloring of the vowel.
The
value of post-vocalic
[r] drops rapidly at the boundary, more
so than do the other two formant values. Olive (1994)
describes this sudden drop in frequency as an indication of anticipator
retroflexion.
Figure 15 is a spectrogram of the word hardware
as produced by a native speaker of Japanese. The
frequency of
the target sound decreases rapidly from around 2800 Hz to 1700
Hz pushing the
level down, while a
slight rise occurs in the
frequency.
AE [r] is difficult for Japanese to pronounce within a post-vocalic position, partly because the Japanese language does not contain consonants in syllable-final positions (except in the case of nasal [n]). Japanese tend to completely drop [r] and replace it with a vowel sound. In Figure 15 observe how the long vowel sound [aa] is substituted for [ar] in the beginning of the word hardware, for example.
Notice that the
value of [aa] is similar
to that of Japanese [a] (
2734) and the first two formants
also occur parallel to one another (
987,
1594).
In Figure 16, the
of [ar] in hardware
neither drops nor moves closer to the second formant as it does in the [ar]
produced by the native speaker of English. Japanese
speakers also sometimes substitute a long vowel [o] (produced as [oo]) for
[or] as in the word report.
Figure 15
Figure 16
Figures 15 and 16. Spectrograms of the word hardware as recorded by a native Japanese speaker and English speaker, respectively.
The results of vowel + [r]
are listed in Table 3.
``Levels'' and errors are
presented similarly as in Table 1.
In general, production of
post-vocalic [r] significantly
improved from the pretest to post-test. Although only three productions of
the target sound were
classified as native-like (Level 1) in the pretest, there was
a significant increase in successful productions of vowel + [r]
in the post-test. A similar increase
occurred in Level 2 (2 to 18) in the post-test. Overall, the total number of successful productions
of vowel + [r] from all four Levels
increased from 22 to 67 in the post-test.
Though the total number of errors decreased from 134 to 92 in the post-test, the results reveal that many students still struggled in improving their production of the target sounds, even after exposure to EVF instruction. The three most common errors in the pretest were [aa] (65), [ea] (34), and [oa] (25). In Figure 16, notice the spectral pattern of [ar] in the word hardware how [ar] is replaced by [aa] and [ea] in the first and second syllables, respectively.
Table 3. Student Production of vowel + [r]


A Japanese speaker often pronounces password as
[passuwaad], (See Figure 17).
The high
value of [r] in password (2800) is more
indicative of the
value of Japanese [a].
The
value of [or] in password is much lower
(1784) when produced by a native speaker of English.
Japanese speakers can also pronounce unstressed [o] as a schwa [
].
Figure 17
Figure 18
Figures 17 and 18.Spectrograms of the word password as recorded by a native Japanese speaker and English speaker, respectively. The results of stressed and unstressed [3] and [o] shown in Table 4 are presented in the same manner as in Table 1. The results show students improved their pronunciation of the target sounds. The number of successful productions increased from 37 to 99 in the post-test; (the number of attempts within Level 1, in particular, increased from 9 to 29.)
Though the total number of errors decreased by more than half (122 to 60) during the post-test, it is evident that some students still struggled in pronouncing stressed [3] and unstressed [o]. The most common error was [aa] of which there were 115 errors.
Table 4. Student Production of Stressed [3] and [o]


Some questions may arise with regard to the type of pretest and post-test that was administered to assess the effectiveness of EVF. One might argue that the use of computer science words in the evaluation unfamiliar to students at the time of the pretest were familiar to them by end of the training period, and this would naturally result in more accurate production of [r] during the post-test. The authors acknowledge this factor may have contributed to improvement in some students' pronunciation during the post-test, but do not feel it was a significant reason for their overall improvement. Another concern is the use of speech read from a text during the testing, as opposed to spontaneous speech, because, it may be that Japanese produce speech from text with a strategy different from when without a text. This discrepancy warrants further investigation.
Although student production of AE [r] significantly improved in each of the contexts, one must be realistic about how much improvement of AE [r] can honestly be expected from Japanese learners within a classroom environment. Some researchers (Lively et. al., 1992 and Flege et. al., 1995) have shown that Japanese learners will begin to develop accurate production and perception of AE [r] only after a significant amount of time spent overseas in an English speaking environment with a great deal of exposure to a variety of native speakers. In light of the evidence presented here, however, we propose that though it is indeed an extremely difficult and challenging task to train Japanese to produce an acceptable AE [r] within a laboratory setting, it is not an entirely impossible one.
Best, C. & Strange, W. (1992) Effects of phonological and phonetic factors on cross-language perception of approximates. Journal of Phonetics 20, 305-330.
Chomsky, N. & Halle, M. (1968). The sound pattern of English. Cambridge, Massachusetts: The MIT Press.
de Bot, C. (1981). Visual feedback of intonation, an experimental approach. In B. Sigurd & J. Svartvik (eds), Proceedings of the Institute of Phonetics (Catholic University of Nijmegen), 3, 24-39.
Flege, J. Takagi, N. & Mann, V. (1995) Japanese adults can learn to produce English /r/ and /l/ accurately. Language and Speech 38 (1) 25-55.
Jones, D. (1967) The phoneme: Its nature and use. Cambridge: Cambridge University Press.
Kent, Ray D. & Read, Charles (1994). The Acoustic Analysis of Speech. San Diego: Singular Publishing Group, Inc.
Ladefoged, Peter (1993). A course in phonetics. New York: Harcourt Brace Jovanovich College Publishers.
Lively, S., Pisoni, D. & Logan J.(1992) Some effects of training Japanese listeners to identify English /r/ and /l/. In Tohkura, Y., Vatikiotis-Bateson E. & Sagisaka, Y. editors, Speech Perception, Production and Linguistic Structure. Tokyo: Ohmsha; Amsterdam: IOS Press, 175-196.
Lively, S., Logan J., & Pisoni, D. (1993) Training Japanese listeners to identify English /r/ and /l/. ll: The role of phonetic environments and talker variability in learning new perceptual categories The Journal Of The Acoustical Society Of America 94 (3), 1242-1255.
Lively, S., Pisoni, D., Yamada, R., Tohkura, & Yamada, T. (1994) Training Japanese listeners to identify English /r/ and /l/. Long-term retention of new phonetic categories. The Journal Of The Acoustical Society Of America 96 (4), 2076-2087.
Molholt, Garry (1988) Computer-assisted instruction in pronunciation for Chinese speakers of American English. TESOL Quarterly, 22 (1), 91-111.
Murakawa, Hisako (1992) The pronunciation of English [r]. The Budo University Journal, pp. 95-106.
Murakawa, Hisako (1995) The pronunciation courseware project. In Izzo, J. (Ed.), Center for Language Research 1994 Annual Review, pp. 1-6. Aizu-Wakamatsu, Fukushima: The University of Aizu.
Olive, J., Greenwood, A. & Coleman, J. (1993) Acoustics of American English speech. New York: Springer-Verlag.
Sheldon, A. & Strange, W. (1982) The acquisition of /r/ and /l/ by Japanese learners of English: Evidence that speech production can precede speech perception. Applied Psycholinguistics, 3, 243-261. New York: Springer-Verlag.
Vance, Timothy J. (1987). An Introduction to Japanese Phonology. New York: State University of New York Press.
Yamada, R. & Tohkura, Y. (1992) Perception of American English /r/ and /l/ by native speakers of Japanese. In Tohkura, Y., Vatikiotis-Bateson E. & Sagisaka, Y. editors, Speech Perception, Production and Linguistic Structure. Tokyo: Ohmsha; Amsterdam: IOS Press, 155-174.
2. I am a computer hardware student.
3. The printer is off line.
4. Please read unit three.
5. I bought a word processor.
6. I finished my report.
7. My terminal is not working.
8. Programming is a keyword.
9. The library uses a bar code system.
10. This room is very warm.