Lemur Search


   How To Apply
   The LTI Brochure


   Dual Ph.D. with Portugal

   Undergrad Minor


LTI Forms

   LTI Seminar Series
   Joint Speech Seminar (JSS)

   Machine Translation (MT)

   LTI Colloquium

   Student Research Symposium

   Information Retrieval Series

   Large Scale Lunch Seminar

   Intelligence Seminar

Visitor Information
   Maps & Directions
   Hotel Links
   Parking Information







   Upcoming Graduates



   Who to See for What

Administrative Contacts

LTI Projects


There are numerous projects at the LTI in the fields of machine translation, speech, information retrieval, knowledge representation, reasoning, and acquisition, language technologies for education, dialogue, computational biology, and natural language processing / computational linguistics, as well as other, interdisciplinary projects.

Information on select older projects is also available.

M a c h i n e T r a n s l a t i o n


The Avenue project has both social and scientific goals in Machine Translation.

Contact: Jaime Carbonell


- Two-way speech-to-speech translation on a handheld computer

This project applies our speech-to-speech translation technology to limited consumer hardware. The Speechalator demonstrator offers two-way speech-to-speech translation from English to Arabic and Arabic to English in the domain of medical interviews running a standard consumer ipaq PDA. This project investigates techniques for rapid development of speech and translation support in new languages as well as ensuring the results can be used on a truly portable device.

Contacts: Alan W Black, Tanja Schultz and Alex Waibel


- Example-Based Machine Translation

The EBMT project works to extend and improve the example-based paradigm of data-driven MT. Its activities include research on subsential alignment (in order to more accurately identify the translation of a matched phrase in training data), generalization of examples (to find more and lengthier matching phrases), and context-sensitivity (to find the most appropriate matches in training data). Recent work in this field resulted in a method for modeling selectable subsets of matching examples for an input phrase and a constraint-based optimization technique for setting the thresholds used to select such subsets; together, these two research outcomes permit the exploration and automatic tuning of quality/quantity tradeoffs during translation. As is the case with many of the LTI's MT projects, techniques explored by EBMT researchers in the past have been adopted by colleagues around the world.

Contact: Ralf Brown


- Global Autonomous Language Exploitation

The goal of the GALE program is to develop and apply computer software technologies to absorb, analyze and interpret huge volumes of speech and text in multiple languages. The LTI's machine translation efforts within this program are aimed at generating the highest-quality translation by improving statistical, example-based, and transfer-based machine translation systems and producing a multi-engine combination of their outputs which outperforms any single translation system used in the combination.

Main: Jaime Carbonell
Architecture: Eric Nyberg
Distillation: Yiming Yang
EBMT: Ralf Brown
Statistical MT: Stephan Vogel


- Knowledge-based Machine Translation

The KANT project was founded in 1989 for the research and development of large-scale, practical translation systems for technical documentation. KANT uses a controlled vocabulary and grammar for each source language, and explicit yet focused semantic models for each technical domain to achieve very high accuracy in translation. Designed for multilingual document production, KANT has been applied to the domains of electric power utility management and heavy equipment technical documentation.

Contacts: Eric Nyberg and Teruko Mitamura

S p e e c h


- Dialogue Management System

The CAMMIA project (A Conversational Agent for Multilingual Mobile Information Access) is focused on research and development of a multi-tasking dialog management system that can be used with automatic speech recognition and VoiceXML to provide mobile information access.

Contacts: Teruko Mitamura and Eric Nyberg

FestVox: Building Synthetic Voices

This project is designed to provide the tools, scripts and documentation to allow people to build synthetic voices for use with general speech applications. Support for English and other languages is provided. Voices produced by these methods run within Edinburgh University's Festival Speech Synthesis System. We also are developing a small, fast synthesis engine suitable for these voices called Flite. This project involves a number of aspects of speech synthesis research including prosodic modelling, unit select synthesis, diphone synthesis, text analysis, lexicon representation, limited domain synthesis. It also provides a forum for research and development of automatic labelling tools and synthesis evaluation tools. Voices build from these methods have been used in other CMU and external projects such as CMU Darpa Communicator spoken dialog system, and a Croatian synthesizer for the DIPLOMAT/Tongue project

Contact: Alan W Black


- A Spoken Dialog System For the General Public

The Let's Go! project is building a spoken dialog system that can be used by the general public. While there has been success in building spoken dialog systems that are able to interact well with people (for example, the CMU Communicator system), these systems often work only for a limited group of people. The system we are developing for Let's Go! is designed to work with a much wider population, including groups which typically have trouble interacting with dialog systems, such as non-native English speakers and the elderly.

The Let's Go! project works in the domain of bus information for Pittsburgh's Port Authority Transit bus system. The system provides a telephone-based interface to bus schedules and route information.

Contacts: Maxine Eskenazi and Alan W Black


The PhonBank project seeks to develop a shared web- based database for the analysis of phonological development. There are 50 members of the PhonBank consortium group who are contributing their data to the project. The PHON program facilitates a number of tasks required for the analysis of phonological development. Phon supports multimedia data linkage, unit segmentation, multiple-blind transcription, automatic labeling of data, and systematic comparisons between target (model) and actual (produced) phonological forms. All of these functions are accessible through a user-friendly graphical interface. Databases managed within Phon can also be queried using a powerful search interface. This software program works on both Mac OS X and Windows platforms, is fully compliant with the CHILDES format, and supports Unicode font encoding. Phon is being made freely available to the community as open-source software. It meets specific needs related to the study of first language phonological development (including babbling), second language acquisition, and speech disorders. Phon will facilitate data exchange among researchers and the construction of a shared PhonBank database, another new initiative within CHILDES to support methodological and empirical needs of research in all areas of phonological development.

Contact: Brian MacWhinney


Ravenclaw is an advanced architecture for dialogue management based on a dynamic representation that captures knowledge about the task that humans perform in given domains. It is the base for a number of dialogue system projects in LTI and elsewhere at Carnegie Mellon. Current dialogue management research centers on two topics: 1) Developing techniques for self-awareness that allow systems to adaptively detect and recover from misunderstandings, 2) Automating the configuration of dialogue systems by inferring task and dialogue structure from human-human interactions in limited domains.

Contact: Alex Rudnicky

SLT4ID: Speech and Language Technologies for International Development

In underserved communities around the world, spoken language systems are potentially more natural, cheaper to deploy/maintain/upgrade, and place less requirements on the user (such as literacy) than traditional PC/GUI-based systems, while still offering valuable services such as information access and education. Our first project in this domain, in collaboration with Aga Khan University (Karachi, Pakistan), focuses on creating a speech user interface for accessing health information resources by community health care workers in Pakistan. We are investigating both audio-only & multi-modal interfaces, and aim to continously ground the research empirically through user studies with the target population. Through this research, we hope to better understand the levels of literacy (if any) that speech-based information access interfaces are compelling.

Contact: Roni Rosenfeld

Speech Graffiti

- Universal Speech Interfaces - USI

Human-machine speech based communication, especially for mobile speech applications and internet speech portals, is fast becoming a reality. Communication with such machines and information-servers does not require the full strength of natural language, nor should it have to cope with its ambiguities. What then is the ideal form of human-machine speech communication? Will there develop a particular style for talking to machines? If so, can we help this process along by developing principles for it? In the Universal Speech Interface (USI) project, we develop and test such principles. In essence, we are trying to do for speech communication what Grafitti(tm) has done for mobile text entry (see also The USI Manifesto).

Contact: Roni Rosenfel


The Sphinx project is an umbrella for research in basic speech technologies. Current activities include real-time adaptive speech recognition in meetings, the exploration of ensemble techniques (creating and using multiple decoders to improve recognition accuracy) and techniques for real-time recognition. Sphinx recognition also supports research in meeting understanding and related activities. The Sphinx recognition code-base is open-source and is used by a number of projects in LTI, elsewhere in the University as well as by a large number of other sites.

Contact: Alex Rudnicky


Speech Processing Interactive Creation and Evaluation Toolkit

Speech technology potentially allows everyone to participate in today's information revolution and can bridge the language barrier gap. Unfortunately, construction of speech processing systems requires significant resources. With some 4500-6000 languages in the world, traditionally speech processing is prohibitive to all but the most economically viable languages. In spite of recent improvements in speech processing, supporting new languages is a skilled job requiring significant effort from trained individuals. This project aims to overcome both limitations by providing innovative methods and tools for naive users to develop speech processing models, collect appropriate data to build these models, and evaluate the results allowing iterative improvements. By integrating speech recognition and synthesis technologies into an interactive language creation and evaluation toolkit usable by unskilled users, speech system generation will be revolutionized. Data and components for new languages will become available to everybody improving the mutual understanding and the educational and cultural exchange between the U.S. and other countries.

Contacts: Tanja Schultz and Alan W Black


- Speech and Language Based Information Management Environment

Project SUBLIME develops and tests a speech- and language-based interfaces for information management. We seek to develop cognitively palatable methods for people to access, add to, and modify both personal and public information spaces. Whereas most existing speech interfaces provide wide functionality in a given narrow domain, a SUBLIME interface seeks to provide a relatively narrow functionality (information management) in unrestricted domains.

Contact: Roni Rosenfeld


Dialogue systems mostly involve one human talking to one machine. What about interaction between the members of a human-robot team? The project focuses on two specific research issues: The management of multi-participant dialogues (touching on issues such as turn-taking) and the development of grounding strategies that allow humans and robots to agree on mutually-understandable descriptions of objects and actions in the context of a treasure hunt.

Contact: Alex Rudnicky


- Spoken Language Communication and Translation System for Tactical Use

In TransTac, we develop algorithms that enable robust, two-way, tactical and spontaneous speech communication between American service personnel and native speakers abroad. In this context, we investigate the rapid deployment of new language pairs (recognition, translation, and synthesis), particularly focusing on low-resource languages and colloquial dialects. Currently, we are working on symmetric two-way translation and Arabic Iraqi, as well as an improved user interface. TransTac builds on our extensive history and experience with speech-to-speech translation systems and is currently implemented on small, ruggedized laptops, but it will run on even smaller devices in the future. TransTac allows us to work on nearly all problems encountered in speech processing and promises to allow for interpersonal communication on topics such as directions, medical aid, and trust building in situations where no communication was possible before.

Contacts: Alan W Black, Florian Metze, Tanja Schultz and Alex Waibel


- Flexible voice synthesis through articulatory voice transformation

We have always wanted our machines to talk to us, but most people have strong preferences for particular voices. Current techniques in speech synthesis can build voices that sound very close to the original speaker, capturing the style, manner and articulation of the source voice. However such systems require many hours of carefully recorded speech and expert tuning to reach an acceptable level of quality.
An exciting new alternative method for building synthetic voices is voice transformation. Here we use an exsisting recorded database and convert it to a target voice using as little as 10-20 sentences. These techniques offer the potential to make speech synthesizers talk in whatever voice we desire, with significantly less effort required than previous techniques.

This project offers a new direction in voice transformation. Current transformation techniques concentrate on a spectral mapping of the voice, i.e. converting the properties of the speech signal. Instead we can use the underlying positions of the vocal tract articulators (i.e. the position of the teeth, tongue, lips, velum) which give rise to the spectral output of the voice.

Using new statistical modeling techniques we can successfully predict the positions of a speaker's articulators from the speech signal. Then in the virtual vocal tract domain map between speakers and regenerate the speech for the target voice.

This work enables the easy construction of new synthetic voices allowing personalization of speech output. It increases our knowledge of the speech generation process and characterizes what make a voice personal.

Contact: Alan W Black

Also see: Fluency

I n f o r m a t i o n R e t r i e v a l

Adaptive Information Filtering

Automatically monitoring a stream of documents (e.g., news stories, news groups, etc) to find just those stories that are interesting to you. Learning from example what kinds of documents you find interesting.

Contacts: Jamie Callan and Yiming Yang

Briefing Assistant

The BA project addresses the problem of creating customized summaries based on the preferences and information demands of humans report preparers. The BA is learning-based and models both the information selection and feature detection behavior of human summarizers. Current work centers on temporal summarization, creating narrative accounts of events that unfold over time.

Contact: Alex Rudnicky

Distributed Information Retrieval / Federated Search

Hundreds of thousands of specialized search engines are available on the Internet, but the contents of many are hidden from general purpose search engines such as Google. This "hidden" Web is estimated to be at least as large as the more traditional "visible" Web. Distributed Information Retrieval (now often called Federated Search) systems provide a single point of access for documents that are in different formats, in different languages, in different types of search engines, and controlled by other people. This research area also covers large, peer-to-peer networks of heterogeneous digital libraries.

Contact: Jamie Callan

Email Classification and Prioritization

Automatically assigning messages to user-defined folders (classes) based on content, importance, communication threads and users' organization strategies is a new challenge for machine learning. Statistical modeling of multi-type interconnected objects (users, messages, folders, keywords, etc.) is an important step towards the development of a truly useful email classification system.

Contact: Yiming Yang


- Text Mining Techniques for Large Public Comment Databases

Citizens and government administrators need a variety of navigation aids and text analysis tools to help them understand the contents of large public comment databases. These aids and tools include full-text search, automatic construction of browsing hierarchies, frequency analysis of discussion topics, and summarization of similar comments, as well as more complex analysis tools that identify stakeholder communities represented in a set of comments. The underlying technologies are primarily Information Retrieval, Text Datamining, and simple forms of Natural Language Processing.

Contact: Jamie Callan


- Open-Domain Question Answering

Typical IR systems return a set of documents, or perhaps a set of queries. LTI Question Answering software extracts information from documents in large, open-domain corpora to answer questions in subject areas that are not known in advance.

Contacts: Eric Nyberg and Teruko Mitamura


The Lemur Project is a collaborative effort between researchers at the LTI and the University of Massachusetts aimed at creating research infrastructure for a broad, international community. The project provides state-of-the-art baseline algorithms and open-source software that can be extended to support a variety of goals. The Lemur Toolkit’s Indri search engine provides a powerful query language; several state-of-the-art retrieval models; indexing support for metadata, text annotations, and multiple text representations; and an index capable of storing more than a billion documents. Research on personalized search is supported by the Lemur Toolbar, which monitors a person’s search-related activities and performs privacy protection and anonymization before search logs are shared. Each year, dozens of papers at the leading IR conferences report on research that was conducted using tools and data created by the Lemur Project.

Contact: Jamie Callan


-Summarization Integrated Development Environment

In this project, we are developing a configurable summarization environment that uses multi-level analyses of discourse to support a new generation of summarization technology addressing a variety of information management tasks.

SIDE is an infrastructure that facilitates construction of summaries tailored to the needs of the user. It aims to address the issue that there is no such thing as the perfect summary for all purposes. Rather, the quality of a summary is subjective, task dependent, and possibly specific to a user. The SIDE framework allows users flexibility in determining what they find more useful in a summary, both in terms of structure and content. In recent work we have begun to explore statistical approaches to text compression that utilize syntactic dependency features and discourse level features to achieve a higher level of fluency at more severe levels of compression. Our near future plans include exploring the idea of text simplification. An important application area is rapid prototyping of reporting interfaces for on-line discussion facilitators.

Contact: Carolyn Rose

Utility-based Information Distillation

We study supervised, unsupervised and semi-supervised learning techniques for automatically detecting novel events and tracking the new trends for relevant events from temporally-ordered documents, for dynamically updating user profiles under context, and for optimizing the utility of passage selection and summarization based on relevance, novelty, readability, readability and user cost (e.g., time). Collaborative and adaptive information filtering among multiple users is also a part of the open challenge.

Contacts: Yiming Yang and Jaime Carbonell

Also see: TagHelper 2.0

K n o w l e d g e R e p r e s e n t a t i o n,

R e a s o n i n g, a n d A c q u i s i t i o n

Dark Matter

- Knowledge Acquisition from Text

LTI is participating in Project Halo, a research effort to design and implement a "Digital Aristotle". Our focus is on the definition of KAL (Knowledge Acquisition Language), a form of controlled language that can be used to acquire domain knowledge from subject matter experts in domains such as Chemistry, Physics and Biology.

Contacts: Eric Nyberg and Teruko Mitamura


- Interlingual Annotation of Multilingual Text Corpora

IAMTC is a multi-site NSF ITR project focusing on the annotation of six sizable bilingual parallel corpora for interlingual content with the goal of providing a significant data set for improving knowledge-based approaches to machine translation (MT) and a range of other Natural Language Processing (NLP) applications. The central goals of the project are: (1) to produce a practical, commonly-shared system for representing the information conveyed by a text, or interlingua (IL), (2) to develop a methodology for accurately and consistently assigning such representations to texts across languages and across annotators, (3) to annotate a sizable multilingual of parallel corpus of source language texts and translations for IL content.

Contacts: Lori Levin and Teruko Mitamura


- Symbolic Knowledge Base

Scone is a high-performance, open-source knowledge-base (KB) system intended for use as a component in many software applications. Scone was specifically designed to support natural language understanding and generation, so our emphasis has been on efficiency, scalability (up to millions of entities and statements), and ease of adding new knowledge – not on theorem-proving or solving logic puzzles. At the LTI, Scone has improved the performance of search engines and document classifiers through the use of background knowledge (which disambiguates references in text and provides synonyms and related words to help the engines and classifiers). The system has also been used to extract events and time relations from free-text recipes, to model the belief states and motivations of characters in children's stories, and to extract meaning from very informal, ungrammatical text and speech. Our long-term goal is to use Scone as the foundation for a true natural-language understanding system and also, to develop a very flexible system for planning and reasoning about actions – applying Scone as the representation engine.

Contact: Scott Fahlman

L a n g u a g e T e c h n o l o g i e s

f o r E d u c a t i o n

Cycle Talk

- Dialogue technology for supporting simulation based learning

In the CycleTalk project, we are developing a new collaborative learning support approach that makes use of tutorial dialogue technology in the context of a collaborative simulation based learning environment for college level thermodynamics instruction.

In order to encourage productive patterns of collaborative discourse, we are using language technologies to develop an infrastructure for scaffolding the interactions between students in computer supported collaborative learning environments, to help coordinate their communication, and to encourage deep thinking and reflection. Students who work with a partner using this support learn 1.25 standard deviations more than their counterparts working individually in the same environment without the collaboration support, where 1 standard deviation translates into 1 full letter grade. An important part of this work is dialogue technology capable of interacting with groups of humans that is designed to draw out reflection and engage students in directed lines of reasoning. This work builds on previous results demonstrating the effectiveness of these tutorial dialogue agents for supporting learning of individuals working alone in this domain.

Contact: Carolyn Rose

Digital Bridges

- Online education fostering partnerships for research and teaching

The key idea of the Digital Bridges project is to extend existing resources for technology-based education to create a vibrant environment for globalized instructional support and collaborative professional development.

Previous pilot efforts towards encouraging global teaching partnerships have been very high-effort, niche partnerships. This project is unique in that it brings together expertise in technology-related fields such as Artificial Intelligence, Machine Learning, Language Technologies, Robotics, and Computer Supported Collaborative Learning, with expertise in international development to propose a much more scalable, organic option that would complement such efforts. Existing resources developed in the team’s prior research such as state-of-the-art technology for supporting highly effective group learning provide both the infrastructure for the global professional development effort as well as one of the resources participating instructors can use. The technology for computer supported collaborative learning that we begin with has already been tested and proven successful on a small scale with multiple age groups (middle school, high school, and college aged students), multiple domains (psychology, earth sciences, mechanical engineering, and math), and multiple cultures (students in the U.S. and students in Taiwan). Thus, it has proven itself ready for testing in this more challenging, diverse, global on-line environment. The learning sciences principles that have provided the theoretical framework for its development have largely come from research conducted in the U.S. and in Europe. Thus, the proposed research provides the opportunity for testing the generality of findings from educational research primarily conducted in the U.S. and in Europe in the developing world, beginning with a pilot effort in collaboration with IIT Guwahati.

Contact: Carolyn Rose


- Foreign language accent correction

Fluency uses speech recognition (SPHINX II) to help users perfect their accents in a foreign language. The system detects pronunciation errors, such as duration mistakes and incorrect phones, and offers visual and aural suggestions as to how to correct them. The user can also listen to himself and to a native speaker.

Contact: Maxine Eskenazi

The Intelligent Writing Tutor (IWT)

The Intelligent Writing Tutor (IWT) project for ESL learners explores the issue of transfer and long-term retention of acquired knowledge, as part of the PSLC's underlying goal of developing a theory of robust learning. Through a series of learning experiments, we will look at both positive and negative transfer from a student's native language (L1) to English, the effects of an informed knowledge tracer on learning, and the role of level-appropriate feedback in achieving competency.

Contact: Teruko Mitamura

PSLC Fluency Studies

With support from the Pittsburgh Science of Learning Center (PSLC-NSF), we have constructed online tutors for the consolidation of basic skills in second language learners of Chinese and French. These tutors assist with learning Chinese pinyin and correct detection of Chinese segments and tones, acquisition of vocabulary in various languages, practice with the assignment of nominal gender in French, and dictation of French from spoken input. Recent work looks at methods for consolidating fluency in sentence repetition and ways of achieving greater robustness in learning.

Contact: Brian MacWhinney

The REAP Project

- Reader-Specific Lexical Practice for Improved Reading Comprehension

The core ideas of the project are i) a search engine that finds text passages satisfying very specific lexical constraints, ii) selecting materials from an open-corpus (the Web), thus satisfying a wide range of student interests and classroom needs, and iii) the ability to model an individual's degree of acquisition and fluency for each word in a constantly-expanding lexicon so as to provide student-specific practice and remediation. This combination enables research on a wide range of reading comprehension topics that were formerly difficult to investigate.

Contacts: Maxine Eskenazi


-Supporting virtual math teams with language technologies

In collaboration with Gerry Stahl and the Math Forum at Drexel University, this project seeks to develop a technological augmentation to available human support in a lightly staffed Virtual Math Teams (VMT) environment as well as deploying conversational agents that are triggered by automatically detected conversational events and that have the ability to elicit valuable collaborative behavior such as reflection, help seeking, and help provision.

Free on-line learning promises to transform the educational landscape of the United States through a significant broadening of supplemental educational opportunities for low income and minority students who do not have access to high quality private tutoring to supplement their in school education. This research attempts to understand how to structure interactions among peer learners in online education environments using language technologies. It seeks to enhance effective participation and learning in the Virtual Math Teams (VMT) online math service, housed in the Math Forum, a major NSF-funded initiative that specifically targets inner-city, low-income minority students, and reaches over a million kids per month with its various services. This will be accomplished by designing, developing, testing, refining and deploying automated interventions to support significantly less expensive but nevertheless highly effective group facilitation. The key research goal is to experimentally learn broadly applicable principles for supporting effective collaborative problem solving by eliciting behavior that is productive for student learning in diverse groups. These principles will be used to optimize the pedagogical effectiveness of the existing VMT-Basilica environment as one example of their concrete realization. The proposed research will yield new knowledge about how characteristics of the on-line VMT environment necessitate adaptation of approaches that have proven successful in lab and classroom studies in order to achieve comparable success in this challenging environment.

Contact: Carolyn Rose

Also see: Project Listen

D i a l o g u e


- Assessing design engineering project classes with multi-disciplinary teams

This project brings together an interdisciplinary team with expertise in computer supported collaborative learning, language and information technologies, engineering education, and a variety of specific engineering fields in order to develop an infrastructure for supporting effective group functioning in engineering design project based courses.

The increasing emphasis in engineering education on maintaining the competitive advantage of U.S. engineers requires an understanding of how students learn the higher-order engineering skills of problem-solving and design, and the increasingly rapid technological change requires students to develop sophisticated information management skills so that they can build on and repurpose innovations and discoveries made in previous projects. The key to addressing these knowledge building problems is to develop DesignWebs, an infrastructure that supports effective storage and retrieval of documents as they evolve as part of a collaborative design process, and GRASP, and automatic assessment technology that monitors the well functioning of design teams through automatic speech processing. The GRASP unobtrusive assessment technology is designed to facilitate the supporting role the instructor can play in the development of team participation in that it that promotes transparency of group work so that instructors are able to identify groups that need more of their attention and involvement.

Contact: Carolyn Rose


-Reconfigurable multi-party dialogue environment

Based on our experiences with designing and engineering multi-party conversational environments such as collaborative learning systems that involve integrating the state of the art in text classification and conversational agent technology, we are developing a framework that facilitates such integration.

The goal of the instructional approach underlying the design of the VMT-Basilica framework is to maximize the benefit students receive from the interactions they have with one another by providing support for learning and effective collaboration in a way that is responsive to what is happening in the interaction in real time. Previous discourse analyses of collaborative conversations reveal that the majority of those interactions between students do not display the “higher order thinking” that collaborative learning is meant to elicit, and we have found this as well in our own observations in lab and classroom studies, both at the college level and at the middle school level. The literature on support for collaborative learning and learning more generally tells us that scaffolding should be faded over time, that over-scripting is detrimental to collaboration, and unnecessary support is demotivating. Thus, a major goal of our research is to address these issues with a framework that allows us to track what is happening in the interaction so that the automatically triggered support interventions can respond to it appropriately. Desiderata of the framework include reusability of component technologies, compatibility with other platforms, and the ability to provide flexibility to system designers to select from a wide range of existing components and then to synchronize, prioritize and coordinate them as desired in a convenient way.

Contact: Carolyn Rose


- Child Language Data Exchange System

The CHILDES Project has focused on the construction of a computerized database for the study of child language acquisition. There are currently 230 corpora in the database from 30 different languages. These corpora are composed of transcripts of spontaneous verbal interactions between young children and their parents, playmates, and teachers. Some of the corpora represent detailed longitudinal studies of single children or small groups of children collected across several years. Others represent cross-sectional studies of larger groups of children recorded less frequently. The total size of the transcript database is 2.0 gigabytes. Many of the transcripts are linked to additional audio and video media files that allow researchers to immediately playback the interactions on the level of individual sentences at any point in the transcript. The project maintains a list of 4000 child language researchers and students who have used the database and has records of over 3000 published articles based on the use of these materials. The project has also constructed a set of computer programs that are useful for conducting research into the various levels of language usage including lexicon, syntax, morphology, phonology, discourse, and narrative.

Contact: Brian MacWhinney

Dynamic Support for Computer Mediated Intercultural Communication

- New Integration of theories and methods from the fields of CSCW and CSCL

Today, people connect with others from around the world in chatrooms, discussion lists, blogs, virtual game communities and other Internet locales. In the work domain, firms are increasingly taking advantage of computer-mediated communication (CMC) tools to establish global teams with members from a diverse set of nations. In education, schools are implementing virtual campuses and immersing students in other cultures. Bridging nations via technology does not, however, guarantee that the cultures of the nations involved are similarly bridged. Mismatches in social conventions, work styles, power relationships and conversational norms can lead to misunderstandings that negatively affect the interaction, relationships among team members, and ultimately the quality of group work. This project seeks to offer a novel, dynamic approach to promoting intercultural communication, adapted from the field of Computer-Supported Collaborative Learning (CSCL) that relies on context sensitive interventions triggered on an as-needed basis. Specifically, the proposed work focuses on communication problems related to what has been called transactivity, or the extent to which messages in a conversation build on one another in appropriate ways. In the CSCL literature, this communication-oriented approach has been used to tailor interventions for on-line collaborative learning dialogues. The proposed work extends this approach to the problem of intercultural communication by (a) identifying and categorizing the types of problems that arise in intercultural dialogues and delineating how these problems impact subjective and objective group outcomes; (b) applying machine learning techniques to coded dialogues with the aim of automatically recognizing when problems arise (or are likely to arise) in an intercultural conversation; and (c) developing and testing interventions to improve intercultural communication that can be triggered by this automatic analysis. These goals are addressed by a combination of laboratory studies of intercultural CMC and machine learning research.

Contact: Carolyn Rose

TagHelper Tools

- Tools for machine learning with text

This project provides a basic resource for researchers who use text processing technology in their work or want to learn about text mining at a basic level. It has been used by a wide range of researchers in fields as diverse as Law, Medicine, Social sciences, Education, Architecture, and Civil Engineering. It has also been used as a teaching tool in a variety of courses both at Carnegie Mellon university and other universities. A specific goal of our research is to develop text classification technology to address concerns specific to classifying sentences using coding schemes developed for behavioral research, especially in the area of computer supported collaborative learning. A particular focus of our work is developing text classification technology that performs well on highly skewed data sets, which is an active area of machine learning research. Another important problem is avoiding overfitting idiosyncratic features on non-IID datasets. TagHelper tools has been downloaded over a thousand times in the past 18 months.

Contact: Carolyn Rose

C o m p u t a t i o n a l B i o l o g y

Biological Language Modeling Project

Pattern recognition from protein sequences and automated mapping between sequences, folding structures and biological functions is a new line of research where we actively collaborate with biologists.

Contacts: Judith Klein-Seetharaman, Jaime Carbonell, Roni Rosenfeld, Yiming Yang and Raj Reddy

Statistical-Computational Models of Molecular Evolution

Molecular evolution is a stochastic computational process that has been running on massively parallel hardware for some 1017 seconds now, and which has resulted in many amazing local maxima along the way. The rapidly growing DNA and protein databases present a historic opportunity to model evolution at an unprecedented quantitative level, with enormous impact on medicine as well as on our fundamental understanding of life. In this project we combine statistical and computational methods to derive biological explanations and pharmacological predictions.

Contact: Roni Rosenfeld

Viruses, Vaccines, and Digital Life

Viruses are the simplest known self-replicating computational systems. They also happen to be the leading emerging threat to humanity in the 21st century. Fortunately, the new understanding of life in general and viruses in particular as digital programs opens the door to computational methods of defending against these threats. This is a new project launched in collaboration with leading virologists at the University of Pittsburgh whose aim is to combine biological analysis with statistical learning methods to better understand viral evolution and accelerate vaccine development.

Contact: Roni Rosenfeld

N a t u r a l L a n g u a g e P r o c e s s i n g /

C o m p u t a t i o n a l L i n g u i s t i c s


This project focuses on the construction of grammatical relations taggers for the English, Spanish, Japanese, and Hebrew data in the CHILDES database.

Contact: Brian MacWhinney


-Recombination, aggregation, and visualization of information in newsworthy expressions

The goal of RAVINE is to automatically produce metadata annotating freely available news articles to show statements attributed to various individuals over time, by different reporters. To do this, we will perform semantic processing on these attributions to assist in tracking multiple potentially conflicting positions over time. We are developing an interface to permit querying and browsing large collections of news stories based on this metadata, aiding humans who wish to understand not only what has been reported in the news, but also variation in news accounts around the world.

Contacts: Alan Black and Noah Smith

O t h e r, I n t e r d i s c i p l i n a r y

P r o j e c t s


The AphasiaBank Project focuses on the construction of a computerized database for the study of language processing in aphasia. A consortium of 60 researchers has developed a shared methodological and conceptual framework for the processes of recording, transcription, coding, analysis, and commentary. These methods are based on the TalkBank XML schema and related computational tools for corpus analysis, parsing, and phonological analysis. Our nine specific aims are: protocol standardization. Database development, analysis customization, measure development, syndrome classification, qualitiatvie analysis, profiles of recovery processes, and the evaluation of treatment effects.

Contact: Brian MacWhinney


The Informedia project tries to understand video, and enable search, visualization and summarization in both contemporaneous and archival content collections. The core technology combines speech, image and natural language understanding to automatically transcribe, segment and index linear video for intelligent search and image retrieval.

Contacts: Howard Wactlar and Alex Hauptmann

Project Listen

- A reading tutor that listens

Project LISTEN's Reading Tutor listens to children read aloud, and then helps them learn to read. This project offers exciting opportunities for interdisciplinary research in speech technologies, cognitive and motivational psychology, human-computer interaction, computational linguistics, artificial intelligence, machine learning, graphic design, and of course reading. Project LISTEN is currently extending its automated Reading Tutor to accelerate children’s progress in fluency, vocabulary, and comprehension.

Contact: Jack Mostow


- The World Wide Knowledge Base Project

The World Wide Web is a vast source of information accessible to computers, but understandable only to humans. The goal of this research project is to automatically create a computer understandable knowledge base whose content mirrors that of the World Wide Web. If successful, this would lead to much more effective retrieval of information from the web, the use of this information to support new knowledge based problem solvers. Our approach is to use machine learning algorithms to train the system to extract information of the desired types. Our web page describes the overall approach, plus several new algorithms we have developed that successfully extract information from the web.

Contact: Tom Mitchell

O l d e r P r o j e c t s


- Real-time analysis of massive structured data

We developed techniques for the indexing of massive incomplete and under-specified data, fast identification of exact and approximate matches among the available data, processing of massive information streams, and real-time identification of both known and surprising patterns.

Contacts: Jaime Carbonell, Eugene Fink, Robert Frederking


In this project, we explored the impact of tutor strategy and example selection on student explanation behavior. The purpose was to identify strategies that make the most productive use of time students spend with a tutorial dialogue system. We collected a corpus of tutoring dialogues in the calculus domain - in which students discussed worked-out examples (that may or may not have contained an error) with a human tutor. The student reasoned through the worked examples and identified, explained, and corrected errors. As part of this project, we experimented with automatic approaches to corpus analysis - applying and extending approaches used previously for text classification, dialogue act tagging, and automatic essay grading.

Contact: Carolyn Rose


Machine learning has been developed to the point where it can perform some truly useful tasks. However, much of the learning technology that's currently available requires extensive 'tuning' in order to work for any particular user, in the context of any particular task.

The focus of the RADAR project was to build a cognitive assistant embodying machine learning technology able to function "in the wild" -- by this, we mean that the technology need not be tuned by experts, and that the person using the system need not be trained in any special way. Using the RADAR system itself, in the task for which it is designed, should be enough to allow RADAR to learn to improve performance.

RADAR was a joint project between SRI International and Carnegie Mellon University and was funded by DARPA.

Contacts: Scott Fahlman and Jaime Carbonell

LTI-related RADAR Components:

Space-Time Planner: Contact: Eugene Fink
Knowledge Representation: Contact: Scott Fahlman
Briefing Assistant: Contact: Alex Rudnicky
NLP/email: Contact: Eric Nyberg
Summarization Contact: Alex Rudnicky

RADAR/Space-Time (Subset of RADAR project)

- Resource management under uncertainty

We built a system for the automated and semi-automated management of office resources, such as office space and equipment. The related research challenges included the representation of uncertain knowledge about available resources, optimization based on uncertain knowledge, elicitation of more accurate data and user preferences, negotiations for office space, learning of user behavior and planning strategies, and collaboration with human administrators.

Contacts: Eugene Fink and Jaime Carbonell


- General-purpose tools for reasoning under uncertainty

We developed general techniques for the representation and analysis of uncertainties in available data, identification of critical uncertainties and missing data, evaluation of their impact on specific conclusions and reasoning tasks, and planning of proactive information gathering.

Contacts: Jaime Carbonell, Eugene Fink, and Anatole Gershman


- Infrastructure for authoring and experimenting with natural language dialogue in tutoring systems and learning research

The focus of this work was to provide an infrastructure that would allow learning researchers to study dialogue in new ways and for educational technology researchers to quickly build dialogue based help systems for their tutoring systems. At the time of this research, we were entering a new phase in which we as a research community had to continue to improve the effectiveness of basic tutorial dialogue technology while also finding ways to accelerate both the process of investigating the effective use of dialogue as a learning intervention and the development of usable tutorial dialogue systems. We developed a community resource to address all three of these problems on a grand scale, building upon prior work developing both basic dialogue technology and tools for rapid development of running dialogue systems.

Contact: Carolyn Rose

Towards Communicating with Dolphins

This project applied aspects of speech technology and machine learning to aid communication with dolphins. Working with Prof Denise Herzing of the Wild Dolphin Project (http://www.wilddolphinproject.com) who has studied, recorded and documented dolphin populations over the last 20 years, we looked at automatically identifying dolphins by their signature whistle and classifying other signals, as well as developing more general techniques to aid communication.

Contacts: Robert Frederking, Tanja Schultz, and Alan W Black


LTI is part of the School of Computer Science at Carnegie Mellon University.
This page is maintained by The LTI Webmaster.