Google Research Blog
The latest news from Research at Google
Crowdsourcing a Text-to-Speech voice for low resource languages (episode 1)
Tuesday, September 08, 2015
Posted by Linne Ha, Senior Program Manager, Google Research for Low Resource Languages
Building a decent text-to-speech (TTS) voice for any language can be challenging, but creating one – a good, intelligible one – for a low resource language can be downright impossible. By definition, working with low resource languages can feel like a losing proposition – from the get go, there is not enough audio data, and the data that exists may be questionable in quality. High quality audio data, and lots of it, is key to developing a high quality machine learning model. To make matters worse, most of the world’s oldest, richest spoken languages fall into this category. There are currently over 300 languages, each spoken by at least one million people, and most will be overlooked by technologists for various reasons. One important reason is that there is not enough data to conduct meaningful research and development.
Project Unison is an on-going Google research effort, in collaboration with the Speech team, to explore innovative approaches to building a TTS voice for low resource languages – quickly, inexpensively and efficiently. This blog post will be one of several to track progress of this experiment and to share our experience with the research community at large – our successes and failures in a trial and error, iterative approach – as our adventure plays out.
One of the most critical aspects of building a TTS system is acquiring audio data. The traditional way to do this is in a professional recording studio with a voice talent, sound engineer and a voice director. The process can take considerable time and can be quite expensive. People often mistake voice talent work to be similar to a news reader, but it is highly specialized and the work can be very difficult.
Such investments in time and money may yield great audio, but the catch is that even if you’ve created the best TTS voice from these recordings, at best it will still sound exactly like the voice talent - the person who provided the raw audio data. (We’ve read the
articles
about people who have fallen for their GPS voice to find that they are real people with real names.) So the interesting problem here from a research perspective is how to create a voice that sounds human but is not identifiable as a singular person.
Crowd-sourcing projects for automatic speech recognition
(ASR) for Google Voice Search had been successful in the past, with public volunteers eager to participate by providing voice samples. For ASR, the objective is to collect from a diversity of speakers and environments, capturing varying regional accents. The polar opposite is true of TTS, where one unique speaker, with the standard accent and in a soundproof studio is the basic criteria.
Many years ago,
Yannis Agiomyrgiannakis
, Digital Signal Processing researcher on the TTS team in Google London, wrote a “manifesto” for acoustic data collection for 2000 languages. In his document, he gave technical specifications on how to convert an average room into a recording studio.
Knot Pipatsrisawat
, software engineer in Google Research for Low Resource Languages, built a tool that we call “ChitChat”, a portable recording studio, using Yannis’ specifications. This web app allows users to read the prompt, playback the recording and even assess the noise level of the room.
From other
past research
in ASR, we knew that the right tool could solve the crowd sourcing problem. ChitChat allowed us to experiment in different environments to get an idea of what kind of office space would work and what kind of problems we might encounter. After experimenting with several different laptops and tablets, we were able to find a computer that recognized the necessary peripherals (the microphone, USB converter, and preamp) for under $2,000 – much cheaper than a recording studio!
Now we needed multiple speakers of a single language. For us, it was a no-brainer to pilot Project Unison with Bangladeshi Googlers, all of whom are passionate about getting Google products to their home country (the success of Android products in Bangladesh is an example of this). Googlers by and large are passionate about their work and many offer their 20% time as a way to help, to improve or to experiment on something that may or may not work because they care. The Bangladeshi Googlers are no exception. They embodied our objectives for a crowdsourcing innovation: out of many, we could achieve (literally) one voice.
With multiple speakers, we would target speakers of similar vocal profiles and adapt them to create a blended voice. Statistical parametric synthesis is not new, but the advances in recent technology have improved quality and proved to be a lightweight solution for a project like ours.
In May of this year, we auditioned 15 Bangaldeshi Googlers in Mountain View. From these recordings, the broader Bangladeshi Google community voted blindly for their preferred voice.
Zakaria Haque
, software engineer in Machine Intelligence, was chosen as our reference for the Bangla voice. We then narrowed down the group to five speakers based on these criteria: Dhaka accent, male (to match Zakaria’s), similarity in pitch and tone, and availability for recordings. The original plan of a spectral analysis using
PRAAT
proved to be unnecessary with our limited pool of candidates.
All 5 software engineers –
Ahmed Chowdury
,
Mohammad Hossain
,
Syeed Faiz
,
Md. Arifuzzaman Arif
,
Sabbir Yousuf Sanny
– plus
Zakaria Haque
recorded over 3 days in the anechoic chamber, a makeshift sound-proofed room at the Mountain View campus just before Ramadan.
HyunJeong Choe
, who had helped with the Korean TTS recordings, directed our volunteers.
Left:
TPM Mohammad Khan measures the distance from the speaker to the mic to keep the sound quality consistent across all speakers.
Right:
Analytical Linguist HyunJeong Choe coaches SWE Ahmed Chowdury on how to speak in a friendly, knowledgeable, "Googly" voice
ChitChat allowed us to troubleshoot on the fly as recordings could be monitored from another room using the admin panel. In total, we recorded 2000 Bangla and English phrases mined from Wikipedia. In 30-60 minute intervals, the participants recorded over 250 sentences each.
In this session, we discovered an issue: a sudden drop in amplitude at high frequencies in a few recordings. We were worried that all the recordings might have to be scrapped.
As illustrated in the third image, speaker3 has a drop in energy above 13kHz which is visible in the graph and may be present at speech, distorting the speaker’s voice to sound as if he were speaking through a tube.
Another challenge was that we didn’t have a pronunciation lexicon for Bangla as spoken in Bangladesh. We worked initially with
the publicly available TTS data from the Indian Institute of Information Technology
, but this represented the variant of Bangla spoken in West Bengal (India), which differs from the speech we recorded. Our internally designed pronunciation rules for Bengali were also aimed at West Bengal and would need to be revised later.
Deciding to proceed anyway,
Alexander Gutkin
, Speech software engineer and lead for TTS for Low Resource Languages in Google London, built an initial prototype voice. Using the preliminary text normalization rules created by
Richard Sproat
, Speech and Language Processing researcher, the first voice we attempted proved to be surprisingly good. The problem in the high frequencies we had seen in the recordings is undetectable in the parametric voice.
When we return to the sound studio to record an additional 200 longer sentences, we plan to try an upgrade of the USB converter. Meanwhile,
Martin Jansche
, Natural Language Understanding software engineer, has worked with a team of native speakers on a pronunciation and lexicon and model that better matches the phonology of colloquial Bangladeshi Bangla. Alexander will use the additional recordings and the new pronunciation dictionary to build the second version.
NEXT UP: Building a parametric voice with multiple speaker data (Ep.2)
An Experiment in Music and Crowd-Sourcing
Friday, May 04, 2012
Posted by Maggie Johnson, Director of Education and University Relations
The
Bodleian Library
is the main research library at the
University of Oxford
. It is also one of the oldest libraries in the world, dating back to the 14th century. But the staff of the Bodleian operates very much in the 21st century, using the latest technology to solve their unique problems.
A few years ago, the library acquired a set of 4,000 popular piano pieces from the mid-Victorian period. There’s very little information available on these pieces, so Bodleian staff decided to use crowdsourcing to collect information on this corpus of music. Through a Google Focused Award, they have digitized and made the entire set of music available online.
By visiting the
What’s-the-Score
website, which opened yesterday, ‘citizen librarians’ can help by describing the scores and contributing to the creation of an online catalogue. They can also include links to audio or video recordings. This is the first time the Bodleian has used this approach to collect catalog information. Typically, a large group of researchers are required to find this information.
Martin Holmes, Alfred Brendel Curator of Music at the Bodleian Libraries, commented: ‘In making the scores available online, they will not only be accessible for academic study and research but will also be there to enjoy for anyone who is interested in various aspects of Victorian music, culture and society.’
The Bodleian Library is one of dozens of recipients of
Google Focused Awards
. These awards are for research in areas of study that are of key interest to Google as well as the research community. These unrestricted gifts are for two to three years, and the recipients have the advantage of access to Google tools, technologies, and expertise. The Bodleian’s experiment in crowdsourcing to build up data on a specialized collection is timely and interesting, given the number of such collections becoming available on the web.
Labels
accessibility
ACL
ACM
Acoustic Modeling
Adaptive Data Analysis
ads
adsense
adwords
Africa
Android
API
App Engine
App Inventor
April Fools
Audio
Australia
Automatic Speech Recognition
Awards
Cantonese
China
Chrome
Cloud Computing
Collaboration
Computational Photography
Computer Science
Computer Vision
conference
conferences
Conservation
correlate
Course Builder
crowd-sourcing
CVPR
Data Center
data science
datasets
Deep Learning
distributed systems
Diversity
Earth Engine
economics
Education
Electronic Commerce and Algorithms
EMEA
EMNLP
Encryption
entities
Entity Salience
Environment
Exacycle
Faculty Institute
Faculty Summit
Flu Trends
Fusion Tables
gamification
Genomics
Gmail
Google Books
Google Drive
Google Science Fair
Google Sheets
Google Translate
Google Voice Search
Google+
Government
grants
HCI
Health
High Dynamic Range Imaging
ICML
ICSE
Image Annotation
Image Classification
Image Processing
Inbox
Information Retrieval
internationalization
Internet of Things
Interspeech
IPython
Journalism
jsm
jsm2011
K-12
KDD
Klingon
Korean
Labs
Linear Optimization
localization
Machine Hearing
Machine Intelligence
Machine Learning
Machine Translation
MapReduce
market algorithms
Market Research
ML
MOOC
NAACL
Natural Language Processing
Natural Language Understanding
Network Management
Networks
Neural Networks
Ngram
NIPS
NLP
open source
operating systems
Optical Character Recognition
optimization
osdi
osdi10
patents
ph.d. fellowship
PiLab
Policy
Professional Development
Proposals
Public Data Explorer
publication
Publications
Quantum Computing
renewable energy
Research
Research Awards
resource optimization
schema.org
Search
search ads
Security and Privacy
SIGCOMM
SIGMOD
Site Reliability Engineering
Software
Speech
Speech Recognition
statistics
Structured Data
Systems
TensorFlow
Translate
trends
TTS
TV
UI
University Relations
UNIX
User Experience
video
Vision Research
Visiting Faculty
Visualization
VLDB
Voice Search
Wiki
wikipedia
WWW
YouTube
Archive
2016
Feb
Jan
2015
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2014
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2013
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2012
Dec
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Nov
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Dec
Nov
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Jul
May
Apr
Mar
Feb
2007
Oct
Sep
Aug
Jul
Jun
Feb
2006
Dec
Nov
Sep
Aug
Jul
Jun
Apr
Mar
Feb
Feed
Follow @googleresearch
Give us feedback in our
Product Forums
.