Google Research Blog
The latest news from Research at Google
Groundbreaking simulations by Google Exacycle Visiting Faculty
Monday, December 16, 2013
Posted by David Konerding, Staff Software Engineer
In April 2011, we
announced
the
Google Exacycle for Visiting Faculty
, a new academic research awards program donating one billion core-hours of computational capacity to researchers. The Exacycle project
enables massive parallelism for doing science in the cloud
, and inspired multiple
proposals
aiming to take advantage of cloud scale. Today, we would like to share some exciting results from a project built on Google’s infrastructure.
Google Research Scientist
Kai Kohlhoff
, in collaboration with Stanford University and Google engineers, investigated how an important
signalling protein
in the membrane of human cells can switch off and on by changing its three-dimensional structure following a sequence of local
conformational changes
. This research can help to better understand the effects of certain chemical compounds on the human body and assist future development of more potent drug molecules with fewer side effects.
The protein, known as the
beta-2 adrenergic receptor
, is a G protein-coupled receptor (
GPCR
), a primary drug target that plays a role in several debilitating health conditions. These include asthma, type-2 diabetes, obesity, and hypertension. The receptor and its close GPCR relatives bind to many familiar molecules, such as epinephrine, beta-blockers, and caffeine. Understanding their structure, function, and the underlying dynamics during binding and activation increases our chances to decode the causes and mechanisms of diseases.
To gain insights into the receptor’s dynamics, Kai performed detailed molecular simulations using hundreds of millions of core hours on Google’s infrastructure, generating hundreds of terabytes of valuable molecular dynamics data. The Exacycle program enabled the realization of simulations with longer sampling and higher accuracy than previous experiments, exposing the complex processes taking place on the nanoscale during activation of this biological switch.
The paper summarizing the results of Kai’s and his collaborators’ work is featured on the January cover of
Nature Chemistry
, with artwork by Google R&D; UX Creative Lead Thor Lewis, to be published on December 17, 2013. The online version of his paper was published on their
website
today.
We are extremely pleased with the results of this program. We look forward to seeing this research continue to develop.
Googler Moti Yung elected as 2013 ACM Fellow
Wednesday, December 11, 2013
Posted by Alfred Spector, VP of Engineering
Yesterday, the Association for Computing Machinery (ACM)
released
the list of those who have been elected ACM Fellows in 2013. I am excited to announce that Google
Research Scientist Moti Yung
is among the distinguished individuals receiving this honor.
Moti was chosen for his contributions to computer science and cryptography that have provided fundamental knowledge to the field of computing security. We are proud of the breadth and depth of his contributions, and believe they serve as motivation for computer scientists worldwide.
On behalf of Google, I congratulate our colleague, who joins the 17 ACM Fellow and other professional society awardees at Google, in exemplifying our extraordinarily talented people. You can read a more detailed summary of Moti’s accomplishments below, including the official citations from ACM.
Dr. Moti Yung: Research Scientist
For contributions to cryptography and its use in security and privacy of systems
Moti has made key contributions to several areas of cryptography including (but not limited to!) secure group communication, digital signatures,
traitor tracing
,
threshold cryptosystems
and
zero knowledge proofs.
Moti's work often seeds a new area in theoretical cryptography as well as finding applications broadly. For example, in 1992, Moti co-developed a protocol by which users can commonly compute a group key using their own private information that is secure against coalitions of rogue users. This work led to the growth of the broadcast encryption research area and has applications to pay-tv, network communication and sensor networks.
Moti is also a long-time leader of the security and privacy research communities, having mentored many of the leading researchers in the field, and serving on numerous program committees. A prolific author, Moti routinely publishes 10+ papers a year, and has been a key contributor to principled and consistent anonymization practices and data protection at Google.
Free Language Lessons for Computers
Tuesday, December 03, 2013
Posted by Dave Orr, Google Research Product Manager
Not everything that can be counted counts.
Not everything that counts can be counted.
-
William Bruce Cameron
50,000 relations from Wikipedia. 100,000 feature vectors from YouTube videos. 1.8 million historical infoboxes. 40 million entities derived from webpages. 11 billion Freebase entities in 800 million web documents. 350 billion words’ worth from books analyzed for syntax.
These are all datasets that we’ve shared with researchers around the world over the last year from Google Research.
But data by itself doesn’t mean much. Data is only valuable in the right context, and only if it leads to increased knowledge. Labeled data is critical to train and evaluate machine-learned systems in many arenas, improving systems that can increase our ability to understand the world. Advances in natural language understanding, information retrieval, information extraction, computer vision, etc. can help us
tell stories
, mine for valuable insights, or
visualize information
in beautiful and compelling ways.
That’s why we are pleased to be able to release sets of labeled data from various domains and with various annotations, some automatic and some manual. Our hope is that the research community will use these datasets in ways both straightforward and surprising, to improve systems for annotation or understanding, and perhaps launch new efforts we haven’t thought of.
Here’s a listing of the major datasets we’ve released in the last year, or you can subscribe to our
mailing list
. Please tell us what you’ve managed to accomplish, or send us pointers to papers that use this data. We want to see what the research world can do with what we’ve created.
50,000 Lessons on How to Read: a Relation Extraction Corpus
What is it
: A human-judged dataset of two relations involving public figures on
Wikipedia
: about 10,000 examples of “place of birth” and 40,000 examples of “attended or graduated from an institution.”
Where can I find it
:
https://code.google.com/p/relation-extraction-corpus/
I want to know more
: Here’s a
handy blog post
with a broader explanation, descriptions and examples of the data, and plenty of links to learn more.
11 Billion Clues in 800 Million Documents
What is it
: We took the ClueWeb corpora and automatically labeled concepts and entities with
Freebase concept IDs
, an example of entity resolution. This dataset is huge: nearly 800 million web pages.
Where can I find it
: We released two corpora:
ClueWeb09 FACC
and
ClueWeb12 FACC
.
I want to know more
: We described the process and results in a recent blog post.
Features Extracted From YouTube Videos for Multiview Learning
What is it
: Multiple feature families from a set of public YouTube videos of games. The videos are labeled with one of 30 categories, and each has an associated set of visual, auditory, and and textual features.
Where can I find it
: The data and more information can be obtained from the
UCI machine learning repository (multiview video dataset)
, or from
Google’s repository
.
I want to know more
: Read more about the data and uses for it
here
.
40 Million Entities in Context
What is it
: A disambiguation set consisting of pointers to 10 million web pages with 40 million entities that have links to Wikipedia. This is another entity resolution corpus, since the links can be used to disambiguate the mentions, but unlike the ClueWeb example above, the links are inserted by the web page authors and can therefore be considered human annotation.
Where can I find it
: Here’s the
WikiLinks corpus
, and tools can be found to help use this data on our partner’s page:
Umass Wiki-links
.
I want to know more
: Other disambiguation sets, data formats, ideas for uses of this data, and more can be found at our
blog post announcing the release
.
Distributing the Edit History of Wikipedia Infoboxes
What is it
: The edit history of 1.8 million infoboxes in Wikipedia pages in one handy resource. Attributes on Wikipedia change over time, and some of them change more than others. Understanding attribute change is important for extracting accurate and useful information from Wikipedia.
Where can I find it
:
Download from Google
or from
Wikimedia Deutschland
.
I want to know more
: We
posted
a detailed look at the data, the process for gathering it, and where to find it. You can also read a
paper
we published on the release.
Note the change in the capital of Palau.
Syntactic Ngrams over Time
What is it
: We automatically syntactically analyzed 350 billion words from the 3.5 million English language books in
Google Books
, and collated and released a set of fragments -- billions of unique tree fragments with counts sorted into types. The underlying corpus is the same one that underlies the recently updated
Google Ngram Viewer
.
Where can I find it
:
http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html
I want to know more
: We discussed the nature of dependency parses and describe the data and release in a
blog post
. We also published a
paper about the release
.
Dictionaries for linking Text, Entities, and Ideas
What is it
: We created a large database of pairs of 175 million strings associated with 7.5 million concepts, annotated with counts, which were mined from Wikipedia. The concepts in this case are Wikipedia articles, and the strings are anchor text spans that link to the concepts in question.
Where can I find it
:
http://nlp.stanford.edu/pubs/crosswikis-data.tar.bz2
I want to know more
: A description of the data, several examples, and ideas for uses for it can be found in a
blog post
or in the
associated paper
.
Other datasets
Not every release had its own blog post describing it. Here are some other releases:
Automatic
Freebase annotations
of Trec’s Million Query and Web track queries.
A
set of Freebase triples
that have been deleted from Freebase over time -- 63 million of them.
Released Data Set: Features Extracted From YouTube Videos for Multiview Learning
Tuesday, November 26, 2013
Posted by Omid Madani, Senior Software Engineer
“If it looks like a duck, swims like a duck, and quacks like a duck, then it
probably
is a duck.”
-
The “duck test”
Performance of machine learning algorithms, supervised or unsupervised, is often significantly enhanced when a variety of feature families, or
multiple views
of the data, are available. For example, in the case of web pages, one feature family can be based on the words appearing on the page, and another can be based on the URLs and related connectivity properties. Similarly, videos contain both audio and visual signals where in turn each modality is analyzed in a variety of ways. For instance, the visual stream can be analyzed based on the color and edge distribution, texture, motion, object types, and so on. YouTube videos are also associated with textual information (title, tags, comments, etc.). Each feature family complements others in providing predictive signals to accomplish a prediction or classification task, for example, in automatically classifying videos into subject areas such as sports, music, comedy, games, and so on.
We have released a dataset of over 100k feature vectors extracted from public YouTube videos. These videos are labeled by one of 30 classes, each class corresponding to a video game (with some amount of class noise): each video shows a gameplay of a video game, for teaching purposes for example. Each instance (video) is described by three feature families (textual, visual, and auditory), and each family is broken into subfamilies yielding up to 13 feature types per instance. Neither video identities nor class identities are released.
We hope that this dataset will be valuable for research on a variety of multiview related machine learning topics, including multiview clustering, co-training, active learning, classifier fusion and ensembles.
The data and more information can be obtained from the
UCI machine learning repository (multiview video dataset)
, or from
here
.
The MiniZinc Challenge
Monday, November 25, 2013
Posted by Jon Orwant, Engineering Manager
Constraint Programming
is a style of problem solving where the properties of a solution are first identified, and a large space of solutions is searched through to find the best. Good constraint programming depends on modeling the problem well, and on searching effectively. Poor representations or slow search techniques can make the difference between finding a good solution and finding no solution at all.
One example of constraint programming is
scheduling
: for instance, determining a schedule for a conference where there are 30 talks (that’s one constraint), only eight rooms to hold them in (that’s another constraint), and some talks can’t overlap (more constraints).
Every year, some of the world’s top constraint programming researchers compete for medals in the MiniZinc challenge. Problems range from scheduling to vehicle routing to program verification and frequency allocation.
Google’s open source solver,
or-tools
, took two gold medals and two silver medals. The gold medals were in parallel and portfolio search, and the silver medals were in fixed and free search. Google’s success was due in part to integrating a
SAT
solver to handle boolean constraints, and a new presolve phase inherited from
integer programming
.
Laurent Perron, a member of Google’s Optimization team and a lead contributor to or-tools, noted that every year brings fresh techniques to the competition: “One of the big surprises this year was the success of lazy-clause generation, which combines techniques from the SAT and constraint programming communities.”
If you’re interested in learning more about constraint programming, you can start at the
wikipedia page
, or have a look at
or-tools
.
The full list of winners is available
here
.
New Research Challenges in Language Understanding
Friday, November 22, 2013
Posted by Maggie Johnson, Director of Education and University Relations
We held the first global Language Understanding and Knowledge Discovery Focused Faculty Workshop in Nanjing, China, on November 14-15, 2013. Thirty-four faculty members joined the workshop arriving from 10 countries and regions across APAC, EMEA and the US. Googlers from Research, Engineering and University Relations/University Programs also attended the event.
The 2-day workshop included keynote talks, panel discussions and break-out sessions [
agenda
]. It was an engaging and productive workshop, and we saw lots of positive interactions among the attendees. The workshop encouraged communication between Google and faculty around the world working in these areas.
Research in text mining continues to explore open questions relating to entity annotation, relation extraction, and more. The workshop’s goal was to brainstorm and discuss relevant topics to further investigate these areas. Ultimately, this research should help provide users search results that are much more relevant to them.
At the end of the workshop, participants identified four topics representing challenges and opportunities for further exploration in Language Understanding and Knowledge Discovery:
Knowledge representation, integration, and maintenance
Efficient and scalable infrastructure and algorithms for inferencing
Presentation and explanation of knowledge
Multilingual computation
Going forward, Google will be collaborating with academic researchers on a position paper related to these topics. We also welcome faculty interested in contributing to further research in this area to submit a proposal to the
Faculty Research Awards program
. Faculty Research Awards are one-year grants to researchers working in areas of mutual interest.
The faculty attendees responded positively to the focused workshop format, as it allowed time to go in depth into important and timely research questions. Encouraged by their feedback, we are considering similar workshops on other topics in the future.
Unique Strategies for Scaling Teacher Professional Development
Tuesday, November 19, 2013
Posted by Candice Reimers, Senior Program Manager
Research shows
that professional development for educators has a direct, positive impact on students, so it’s no wonder that institutions are eager to explore creative ways to enhance professional development for K-12 teachers. Open source MOOC platforms, such as
Course Builder
, offer the flexibility to extend the reach of standard curriculum; recently, several courses have launched that demonstrate new and creative applications of MOOCs. With their wide reach, participant engagement, and rich content, MOOCs that offer professional development opportunities for teachers bring flexibility and accessibility to an important area.
This summer, the ScratchEd team out of Harvard University launched the
Creative Computing
MOOC, a 6 week self paced workshop focused on building computational thinking skills in the classroom. As a MOOC, the course had 2600 participants, who created more than 4700 Scratch projects, and engaged in 3500 forum discussions, compared to the “in-person” class held last year, which reached only 50 educators.
Other creative uses of Course Builder for educator professional development come from
National Geographic
and
Annenberg Learner
who joined forces to develop
Water: The Essential Resource
, a course developed around California’s Education and Environment Initiative.
The Friday Institute
’s MOOC,
Digital Learning Transitions
, focused on the benefits of utilizing educational technology and reached educators across 50 states and 68 countries worldwide. The course design included embedded peer support, project-based learning, and case studies; a
post-course survey
showed an overwhelming majority of responders “were able to personalize their own learning experiences” in an “engaging, easy to navigate” curriculum and greatly appreciated the 24/7 access to materials.
In addition to participant surveys, course authors using the Course Builder platform are able to conduct deeper analysis via web analytics and
course data
to assess course effectiveness and make improvements for future courses.
New opportunities to experience professional development MOOCs are rapidly emerging; the University of Adelaide recently announced their
Digital Technology course
to provide professional development for primary school teachers on the
new Australian curriculum
, the Google in Education team just launched
a suite of courses
for teachers using Google technologies, and the Friday Institute
course
that aligns with the U.S. based
Common Core State Standards
is now available.
We’re excited about the innovative approaches underway and the positive impact it can have for students and teachers around the world. We also look forward to seeing new, creative applications of MOOC platforms in new, unchartered territory.
Moore’s Law Part 4: Moore's Law in other domains
Friday, November 15, 2013
This is the last entry of a series focused on Moore’s Law and its implications moving forward, edited from a White paper on Moore’s Law, written by Google University Relations Manager Michel Benard. This series quotes major sources about Moore’s Law and explores how they believe Moore’s Law will likely continue over the course of the next several years. We will also explore if there are fields other than digital electronics that either have an emerging Moore's Law situation, or promises for such a Law that would drive their future performance.
--
The quest
for Moore’s Law
and its potential impact in other disciplines is a journey the technology industry is starting, by crossing the Rubicon from the semiconductor industry to other less explored fields, but with the particular mindset created by Moore’s Law. Our goal is to explore if there are Moore’s Law opportunities emerging in other disciplines, as well as its potential impact. As such, we have interviewed several professors and researchers and asked them if they could see emerging ‘Moore’s Laws’ in their discipline. Listed below are some highlights of those discussions, ranging from CS+ to potentials in the Energy Sector:
Sensors and Data Acquisition
Ed Parsons
, Google Geospatial Technologist
The More than Moore discussion can be extended to outside of the main chip, and go within the same board as the main chip or within the device that a user is carrying. Greater sensors capabilities (for the measurement of pressure, electromagnetic field and other local conditions) allow including them in smart phones, glasses, or other devices and perform local data acquisition. This trend is strong, and should allow future devices benefiting from Moore’s Law to receive enough data to perform more complex applications.
Metcalfe’s Law
states that the value of a telecommunication network is proportional to the square of connected nodes of the system. This law can be used in parallel to Moore’s Law to evaluate the value of the
Internet of Things
. The network itself can be seen as composed by layers: at the user’s local level (to capture data related to the body of the user, or to immediately accessible objects), locally around the user (such as to get data within the same street as the user), and finally globally (to get data from the global internet). The extrapolation made earlier in this blog (several TB available in flash memory) will lead to the ability to construct, exchange and download/upload entire contexts for a given situation or a given application and use these contexts without intense network activity, or even with very little or no network activity.
Future of Moore’s Law and its impact on Physics
Sverre Jarp
, CERN
CERN
, and its experiments with the Large Electron-Positron Collider (
LEP
) and Large Hadron Collider (LHC) generate data on the order of a PetaByte per year; this data has to be filtered, processed and analyzed in order to find meaningful physics events leading to new discoveries. In this context Moore’s Law has been particularly helpful to allow computing power, storage and networking capabilities at CERN and at other High Energy Physics (
HEP
) centers to scale up regularly. Several generations of hardware and software have been exhausted during the journey from mainframes to today’s clusters.
CERN has a long tradition of collaboration with chip manufacturers, hardware and software vendors to understand and predict next trends in the computing evolution curve. Recent analysis indicates that Moore’s Law will likely continue over the next decade. The statement of ‘several TB of flash memory availability by 2025’ may even be a little conservative according to most recent analysis.
Big Data Visualizations
Katy Börner
, Indiana University
Thanks to Moore’s Law, the amount of data available for any given phenomenon, whether sensed or simulated, has been growing by several orders of magnitude over the past decades. Intelligent sampling can be used to filter out the most relevant bits of information and is practiced in Physics, Astronomy, Medicine and other sciences. Subsequently, data needs to be analyzed and visualized to identify meaningful trends and phenomena, and to communicate them to others.
While most people learn in school how to read charts and maps, many never learn how to read a network layout—data literacy remains a challenge. The
Information Visualization Massive Open Online Course (MOOC)
at Indiana University teaches students from more than 100 countries how to read but also how to design meaningful network, topical, geospatial, and temporal visualizations. Using the tools introduced in this free course anyone can analyze, visualize, and navigate complex data sets to understand patterns and trends.
Candidate for Moore’s Law in Energy
Professor Francesco Stellacci
, EPFL
It is currently hard to see a “Moore’s Law” applying to candidates in energy technology. Nuclear fusion could reserve some positive surprises, if several significant breakthroughs are found in the process of creating usable energy with this technique. For any other technology the technological growth will be slower. Best solar cells of today have a 30% efficiency, which could scale higher of course (obviously not much more than a factor of 3). Also cost could be driven down by an order of magnitude. Best estimates show, however, a combined performance improvement by a factor 30 over many years.
Further Discussion of Moore’s Law in Energy
Ross Koningstein
, Google Director Emeritus
As of today there is no obvious Moore’s Law in the Energy sector which could decrease some major costs by 50% every 18 months. However material properties at nanoscale, and chemical processes such as
catalysis
are being investigated and could lead to promising results. Applications targeted are
hydrocarbon
creation at scale and improvement of
oil refinery processes
, where breakthrough in micro/nano property catalysts is pursued. Hydrocarbons are much more compatible at scale with the existing automotive/aviation and natural gas distribution systems. Here in California,
Google Ventures
has invested in
Cool Planet Energy Systems
, a company with neat technology that can convert biomass to gasoline/jet fuel/diesel with impressive efficiency.
One of the challenges is the ability to run many experiments at low cost per experiment, instead of only a few expensive experiments per year. Discoveries are likely to happen faster if more experiments are conducted. This leads to heavier investments, which are difficult to achieve within slim margin businesses. Therefore the nurturing processes for disruptive business are likely to come from new players, beside existing players which will decide to fund significant new investments.
Of course, these discussions could be opened for many other sectors. The opportunities for more discourse on the impact and future of Moore’s Law on CS and other disciplines are abundant, and can be continued with your comments on the
Research at Google Google+ page
. Please join, and share your thoughts.
The first detailed maps of global forest change
Thursday, November 14, 2013
Posted by Matt Hansen and Peter Potapov, University of Maryland; Rebecca Moore and Matt Hancher, Google
Most people are familiar with exploring images of the Earth’s surface in Google Maps and Earth, but of course there’s more to satellite data than just pretty pictures. By applying algorithms to time-series data it is possible to quantify global land dynamics, such as forest extent and change. Mapping global forests over time not only enables many science applications, such as climate change and biodiversity modeling efforts, but also informs policy initiatives by providing objective data on forests that are ready for use by governments, civil society and private industry in improving forest management.
In a collaboration led by researchers at the University of Maryland, we built a new map product that quantifies global forest extent and change from 2000 to 2012. This product is the first of its kind, a global 30 meter resolution thematic map of the Earth’s land surface that offers a consistent characterization of forest change at a resolution that is high enough to be locally relevant as well. It captures myriad forest dynamics, including fires, tornadoes, disease and logging.
Global 30 meter resolution thematic maps of the Earth’s land surface: Landsat composite reference image (2000), summary map of forest loss, extent and gain (2000-2012), individual maps of forest extent, gain, loss, and loss color-coded by year.
Click to enlarge
The satellite data came from the Enhanced Thematic Mapper Plus (ETM+) sensor onboard the NASA/USGS
Landsat 7
satellite. The expertise of NASA and USGS, from satellite design to operations to data management and delivery, is critical to any earth system study using Landsat data. For this analysis, we processed over 650,000 ETM+ images in order to characterize global forest change.
Key to the study’s success was the collaboration between remote sensing scientists at the University of Maryland, who developed and tested models for processing and characterizing the Landsat data, and computer scientists at Google, who oversaw the implementation of the final models using Google’s Earth Engine computation platform.
Google Earth Engine
is a massively parallel technology for high-performance processing of geospatial data, and houses a copy of the entire Landsat image catalog. For this study, a total of 20 terapixels of Landsat data were processed using one million CPU-core hours on 10,000 computers in parallel, in order to characterize year 2000 percent tree cover and subsequent tree cover loss and gain through 2012. What would have taken a single computer 15 years to perform was completed in a matter of days using Google Earth Engine computing.
Global forest loss totaled 2.3 million square kilometers and gain 0.8 million square kilometers from 2000 to 2012. Among the many results is the finding that tropical forest loss is increasing with an average of 2,101 additional square kilometers of forest loss per year over the study period. Despite the reduction in Brazilian deforestation over the study period, increasing rates of forest loss in countries such as Indonesia, Malaysia, Tanzania, Angola, Peru and Paraguay resulted in a statistically significant trend in increasing tropical forest loss. The maps and statistics from this study fill an information void for many parts of the world. The results can be used as an initial reference for countries lacking such information, as a spur to capacity building in such countries, and as a basis of comparison in evolving national forest monitoring methods. Additionally, we hope it will enable further science investigations ranging from the evaluation of the integrity of protected areas to the economic drivers of deforestation to carbon cycle modeling.
The Chaco woodlands of Bolivia, Paraguay and Argentina are under intensive pressure from agroindustrial development. Paraguay’s Chaco woodlands within the western half of the country are experiencing rapid deforestation in the development of cattle ranches. The result is the highest rate of deforestation in the world.
Click to enlarge
Global map of forest change:
http://earthenginepartners.appspot.com/science-2013-global-forest
If you are curious to learn more, tune in next Monday, November 18 to a live-streamed, online presentation and demonstration by Matt Hansen and colleagues from UMD, Google, USGS, NASA and the Moore Foundation:
Live-stream Presentation: Mapping Global Forest Change
Live online presentation and demonstration, followed by Q&A
Monday, November 18, 2013 at 1pm EST, 10am PST
Link to live-streamed event:
http://goo.gl/JbWWTk
Please submit questions here:
http://goo.gl/rhxK5X
For further results and details of this study, see
High-Resolution Global Maps of 21st-Century Forest Cover Change
in the November 15th issue of the journal Science.
Moore’s Law, Part 3: Possible extrapolations over the next 15 years and impact
Wednesday, November 13, 2013
This is the third entry of a series focused on Moore’s Law and its implications moving forward, edited from a White paper on Moore’s Law, written by Google University Relations Manager Michel Benard. This series quotes major sources about Moore’s Law and explores how they believe Moore’s Law will likely continue over the course of the next several years. We will also explore if there are fields other than digital electronics that either have an emerging Moore's Law situation, or promises for such a Law that would drive their future performance.
--
More Moore
We examine data from the ITRS 2012
Overall Roadmap Technology Characteristics
(ORTC 2012), and select notable interpolations; The chart below shows chip size trends up to the year 2026 along with the “Average Moore’s Law” line. Additionally, in the
ORTC 2011 tables
we find data on 3D chip layer increases (up to 128 layers), including costs. Finally, the ORTC 2011 index sheet estimates that the
DRAM
cost per bit at production will be ~0.002 microcents per bit by ~2025. From these sources we draw three More Moore (MM) extrapolations, that by the year 2025:
4Tb Flash
multi-level cell
(MLC) memory will be in production
There will be ~100 billion transistors per microprocessing unit (MPU)
1TB RAM Memory will cost less than $100
More than Moore
It should be emphasized that “More than Moore” (MtM) technologies do not constitute an alternative or even a competitor to the digital trend as described by Moore’s Law. In fact, it is the heterogeneous integration of digital and non-digital functionalities into compact systems that will be the key driver for a wide variety of application fields. Whereas MM may be viewed as the brain of an intelligent compact system, MtM refers to its capabilities to interact with the outside world and the users.
As such, functional diversification may be regarded as a complement of digital signal and data processing in a product. This includes the interaction with the outside world through sensors and actuators and the subsystem for powering the product, implying analog and mixed signal processing, the incorporation of passive and/or high-voltage components, micro-mechanical devices enabling biological functionalities, and more. While MtM looks very promising for a variety of diversification topics, the ITRS study does not give figures from which “solid” extrapolations can be made. However, we can make safe/not so safe bets going towards 2025, and examine what these extrapolations mean in terms of the user.
Today we have a 1TB hard disk drives (HDD) for $100, but the access speed to data on the disk does not allow to take full advantage of this data in a fully interactive, or even practical, way. More importantly, the size and construction of HDD does not allow for their incorporation into mobile devices, Solid state drives (SSD), in comparison, have similar data transfer rates (~1Gb/s), latencies typically 100 times less than HDD, and have a significantly smaller form factor with no moving parts. The promise of offering several TB of flash memory, cost effectively by 2025, in a device carried along during the day (e.g. smartphone, watch, clothing, etc.) represents a paradigm shift with regard of today’s situation; it will empower the user by moving him/her from an environment where local data needs to be refreshed frequently (as with augmented reality applications) to a new environment where full contextual data will be available locally and refreshed only when critically needed.
If data is pre-loaded in the order of magnitude of TBs, one will be able to get a complete contextual data set loaded before an action or a movement, and the device will dispatch its local intelligence to the user during the progress of the action, regardless of network availability or performance. This opens up the possibility of combining local 3D models and remote inputs, allowing applications like 3D conferencing to become available. The development and use of 3D avatars could even facilitate many social interaction models. To benefit from such applications the use of personal devices such as Google Glass may become pervasive, allowing users to navigate 3D scenes and environments naturally, as well as facilitating 3D conferencing and their “social” interactions.
The opportunities for more discourse on the impact and future of Moore’s Law on CS and other disciplines are abundant, and can be continued with your comments on the
Research at Google Google+ page
. Please join, and share your thoughts.
Moore’s Law, Part 2: More Moore and More than Moore
Tuesday, November 12, 2013
This is the second entry of a series focused on Moore’s Law and its implications moving forward, edited from a White paper on Moore’s Law, written by Google University Relations Manager Michel Benard. This series quotes major sources about Moore’s Law and explores how they believe Moore’s Law will likely continue over the course of the next several years. We will also explore if there are fields other than digital electronics that either have an emerging Moore's Law situation, or promises for such a Law that would drive their future performance.
--
One of the fundamental lessons derived for the past successes of the semiconductor industry comes for the observation that most of the innovations of the past ten years—those that indeed that have revolutionized the way CMOS transistors are manufactured nowadays—were initiated 10–15 years before they were incorporated into the CMOS process. Strained silicon research began in the early 90s, high-κ/metal-gate initiated in the mid-90s and multiple-gate transistors were pioneered in the late 90s. This fundamental observation generates a simple but fundamental question: “What should the ITRS do to identify now what the extended semiconductor industry will need 10–15 years from now?”
-
International Technology Roadmap for Semiconductors 2012
More Moore
As we look at the years 2020–2025, we can see that the physical dimensions of
CMOS
manufacture are expected to be crossing below the 10 nanometer threshold. It is expected that as dimensions approach the 5–7 nanometer range it will be difficult to operate any transistor structure that is utilizing the metal-oxide semiconductor (MOS) physics as the basic principle of operation. Of course, we expect that new devices, like the
very promising tunnel transistors
, will allow a smooth transition from traditional CMOS to this new class of devices to reach these new levels of miniaturization. However, it is becoming clear that fundamental geometrical limits will be reached in the above timeframe. By fully utilizing the vertical dimension, it will be possible to
stack layers of transistors
on top of each other, and this 3D approach will continue to increase the number of components per square millimeter even when horizontal physical dimensions will no longer be amenable to any further reduction. It seems important, then, that we ask ourselves a fundamental question: “How will we be able to increase the computation and memory capacity when the device physical limits will be reached?” It becomes necessary to re-examine how we can get more information in a finite amount of space.
The semiconductor industry has thrived on
Boolean logic
; after all, for most applications the CMOS devices have been used as nothing more than an “on-off” switch. Consequently, it becomes of paramount importance to develop new techniques that allow the use of multiple (i.e., more than 2) logic states in any given and finite location, which evokes the magic of “
quantum computing
” looming in the distance. However, short of reaching this goal, a field of active research involves
increasing the number of states
available, e.g. 4–10 states, and to increase the number of “virtual transistors” by 2 every 2 years.
More than Moore
During the blazing progress propelled by Moore’s Law of semiconductor logic and memory products, many “complementary” technologies have progressed as well, although not necessarily scaling to Moore’s Law. Heterogeneous integration of multiple technologies has generated “added value” to devices with multiple applications, beyond the traditional semiconductor logic and memory products that had lead the semiconductor industry from the mid 60s to the 90s. A variety of wireless devices contain typical examples of this confluence of technologies, e.g. logic and memory devices, display technology, microelectricomechanical systems (
MEMS
), RF and Analog/Mixed-signal technologies (
RF/AMS
), etc.
The ITRS has incorporated More than Moore and RF/AMS chapters in the main body of the ITRS, but is uncertain whether this is sufficient to encompass the plethora of associated technologies now entangled into modern products, or the multi-faceted public consumer who has become an influential driver of the semiconductor industry, demanding custom functionality in commercial electronic products. In the next blog of this series, we will examine select data from the
ITRS Overall Roadmap Technology Characteristics (ORTC) 2012
and attempt to extrapolate the progress in the next 15 years, and its potential impact.
The opportunities for more discourse on the impact and future of Moore’s Law on CS and other disciplines are abundant, and can be continued with your comments on the
Research at Google Google+ page
. Please join, and share your thoughts.
Moore’s Law, Part 1: Brief history of Moore's Law and current state
Monday, November 11, 2013
This is the first entry of a series focused on Moore’s Law and its implications moving forward, edited from a White paper on Moore’s Law, written by Google University Relations Manager Michel Benard. This series quotes major sources about Moore’s Law and explores how they believe Moore’s Law will likely continue over the course of the next several years. We will also explore if there are fields other than digital electronics that either have an emerging Moore's Law situation, or promises for such a Law that would drive their future performance.
---
Moore's Law is the observation that over the
history of computing hardware
, the number of transistors on integrated circuits doubles approximately every two years. The period often quoted as "18 months" is due to Intel executive David House, who predicted that period for a doubling in chip performance (being a combination of the effect of more transistors and their being faster).
-
Wikipedia
Moore’s Law is named after Intel co-founder
Gordon E. Moore
, who described the trend in his
1965 paper
. In it, Moore noted that the number of components in integrated circuits had doubled every year from the invention of the integrated circuit in 1958 until 1965 and predicted that the trend would continue "for at least ten years". Moore’s prediction has proven to be uncannily accurate, in part because the law is now used in the semiconductor industry to guide long-term planning and to set targets for research and development.
The capabilities of many digital electronic devices are strongly linked to Moore's law: processing speed, memory capacity, sensors and even the number and size of
pixels in digital cameras
. All of these are improving at (roughly) exponential rates as well (see
Other formulations and similar laws
). This exponential improvement has dramatically enhanced the impact of digital electronics in nearly every segment of the
world economy
, and is a driving force of technological and social change in the late 20th and early 21st centuries.
Most improvement trends have resulted principally from the industry’s ability to exponentially decrease the minimum feature sizes used to fabricate integrated circuits. Of course, the most frequently cited trend is in integration level, which is usually expressed as Moore’s Law (that is, the number of components per chip doubles roughly every 24 months). The most significant trend is the decreasing cost-per-function, which has led to significant improvements in economic productivity and overall quality of life through proliferation of computers, communication, and other industrial and consumer electronics.
Transistor counts for integrated circuits plotted against their dates of introduction. The curve shows Moore's law - the doubling of transistor counts every two years. The y-axis is logarithmic, so the line corresponds to exponential growth
All of these improvement trends, sometimes called “scaling” trends, have been enabled by large R&D investments. In the last three decades, the growing size of the required investments has motivated industry collaboration and spawned many R&D partnerships, consortia, and other cooperative ventures. To help guide these R&D programs, the Semiconductor Industry Association (SIA) initiated the National Technology Roadmap for Semiconductors (
NTRS
) in 1992. Since its inception, a basic premise of the NTRS has been that continued scaling of electronics would further reduce the cost per function and promote market growth for integrated circuits. Thus, the Roadmap has been put together in the spirit of a challenge—essentially, “What technical capabilities need to be developed for the industry to stay on Moore’s Law and the other trends?”
In 1998, the SIA was joined by corresponding industry associations in Europe, Japan, Korea, and Taiwan to participate in a 1998 update of the Roadmap and to begin work toward the first International Technology Roadmap for Semiconductors (
ITRS
), published in 1999. The overall objective of the ITRS is to present industry-wide consensus on the “best current estimate” of the industry’s research and development needs out to a 15-year horizon. As such, it provides a guide to the efforts of companies, universities, governments, and other research providers or funders. The ITRS has improved the quality of R&D investment decisions made at all levels and has helped channel research efforts to areas that most need research breakthroughs.
For more than half a century these scaling trends continued, and
sources in 2005
expected it to continue until at least 2015 or 2020. However, the
2010 update to the ITRS
has growth slowing at the end of 2013, after which time transistor counts and densities are to double only every three years. Accordingly, since 2007 the ITRS has addressed the concept of functional diversification under the title “
More than Moore
” (MtM). This concept addresses an emerging category of devices that incorporate functionalities that do not necessarily scale according to “Moore's Law,” but provide additional value to the end customer in different ways.
The MtM approach typically allows for the non-digital functionalities (e.g., RF communication, power control, passive components, sensors, actuators) to migrate from the system board-level into a particular package-level (
SiP
) or chip-level (
SoC
) system solution. It is also hoped that by the end of this decade, it will be possible to augment the technology of constructing integrated circuits (
CMOS
) by introducing new devices that will realize some “beyond CMOS” capabilities. However, since these new devices may not totally replace CMOS functionality, it is anticipated that either chip-level or package level integration with CMOS may be implemented.
The ITRS provides a very comprehensive analysis of the perspective for Moore’s Law when looking towards 2020 and beyond. The analysis can be roughly segmented into two trends: More Moore (MM) and More than Moore (MtM). In the next blog in this series, we will look in the the recent conclusions mentioned in the ITRS 2012 report on both trends.
The opportunities for more discourse on the impact and future of Moore’s Law on CS and other disciplines are abundant, and can be continued with your comments on the
Research at Google Google+ page
. Please join, and share your thoughts.
Enhancing Linguistic Search with the Google Books Ngram Viewer
Thursday, October 17, 2013
Posted by Slav Petrov and Dipanjan Das, Research Scientists
Our book scanning effort, now in its eighth year, has put tens of millions of books online. Beyond the obvious benefits of being able to discover books and search through them, the project lets us take a step back and learn what the entire collection tells us about culture and language.
Launched in 2010 by Jon Orwant and Will Brockman, the Google Books Ngram Viewer lets you search for words and phrases over the centuries, in English, Chinese, Russian, French, German, Italian, Hebrew, and Spanish. It’s become popular for both casual explorations into language usage and serious linguistic research, and this summer we decided to provide some new ways to search with it.
With our interns Jason Mann, Lu Yang, and David Zhang, we’ve added three new features. The first is wildcards: by putting an asterisk as a placeholder in your query, you can retrieve the ten most popular replacement. For instance,
what noun most often follows “Queen” in English fiction
? The answer is “Elizabeth”:
This graph also reveals that the frequency of mentions of the most popular queens has been decreasing steadily over time. (Language expert Ben Zimmer shows some other interesting examples in
his Atlantic article
.) Right-clicking collapses all of the series into a sum, allowing you to see the overall change.
Another feature we’ve added is the ability to search for inflections: different grammatical forms of the same word. (Inflections of the verb “eat” include “ate”, “eating”, “eats”, and “eaten”.) Here, we can see that
the phrase “changing roles” has recently surged in popularity in English fiction
, besting “change roles”, which earlier dethroned “changed roles”:
Curiously, this switching doesn’t happen
when we add non-fiction into the mix
: “changing roles” is persistently on top, with an odd dip in the late 1980s. As with wildcards, right-clicking collapses and expands the data:
Finally, we’ve implemented the most common feature request from our users: the ability to search for multiple capitalization styles simultaneously. Until now, searching for
common capitalizations of “Mother Earth”
required using a plus sign to combine ngrams (e.g., “Mother Earth + mother Earth + mother earth”), but now the case-insensitive checkbox makes it easier:
As with our other two features, right-clicking toggles whether the variants are shown.
We hope these features help you discover and share interesting trends in language use!
Opening up Course Builder data
Wednesday, October 09, 2013
Posted by John Cox and Pavel Simakov, Course Builder Team, Google Research
Course Builder
is an experimental, open source platform for delivering massive online open courses. When you run Course Builder, you own everything from the production instance to the student data that builds up while your course is running.
Part of being open is making it easy for you to access and work with your data. Earlier this year we shipped a tool called ETL (short for extract-transform-load) that you can use to pull your data out of Course Builder, run arbitrary computations on it, and load it back. We
wrote a post
that goes into detail on how you can use ETL to get copies of your data in an open, easy-to-read format, as well as write custom jobs for processing that data offline.
Now we’ve taken the next step and added richer data processing tools to ETL. With them, you can
build data processing pipelines
that analyze large datasets with MapReduce. Inside Google we’ve used these tools to
learn from the courses we’ve run
. We provide example pipelines ranging from the simple to the complex, along with formatters to convert your data into open formats (CSV, JSON, plain text, and XML) that play nice with third-party data analysis tools.
We hope that adding robust data processing features to Course Builder will not only provide direct utility to organizations that need to process data to meet their internal business goals, but also make it easier for educators and researchers to gauge the efficacy of the massive online open courses run on the Course Builder platform.
Projecting without a projector: sharing your smartphone content onto an arbitrary display
Thursday, September 26, 2013
Posted by Yang Li, Research Scientist, Google Research
Previously, we presented
Deep Shot
, a system that allows a user to “capture” an application (such as Google Maps) running on a remote computer monitor via a smartphone camera and bring the application on the go. Today, we’d like to discuss how we support the opposite process, i.e., transferring mobile content to a remote display, again using the smartphone camera.
Although the computing power of today’s mobile devices grows at an accelerated rate, the form factor of these devices remains small, which constrains both the input and output bandwidth for mobile interaction. To address this issue, we investigated how to enable users to leverage nearby IO resources to operate their mobile devices. As part of the effort, we developed
Open Project
, an end-to-end framework that allows a user to “project” a native mobile application onto an arbitrary display using a smartphone camera, leveraging interaction spaces and input modality of the display. The display can range from a PC or laptop monitor, to a home Internet TV and to a public wall-sized display. Via an intuitive, projection-based metaphor, a user can easily share a mobile application by projecting it onto a target display.
Open Project is an open, scalable, web-based framework for enabling mobile sharing and collaboration. It can turn any computer display projectable instantaneously and without deployment. Developers can add support for Open Project in native mobile apps by simply linking a library, requiring no additional hardware or sensors. Our user participants responded highly positively to Open Project-enabled applications for mobile sharing and collaboration.
Broadening Google Patents
Tuesday, September 17, 2013
Posted by Jon Orwant, Engineering Manager
Cross-posted with the
US Public Policy Blog
, the
European Public Policy Blog
, and
Inside Search Blog
.
Last year, we launched two improvements to
Google Patents
: the
Prior Art Finder
and European Patent Office (EPO) patents. Today we’re happy to announce the addition of documents from four new patent agencies: China, Germany, Canada, and the World Intellectual Property Organization (WIPO). Many of these documents may provide prior art for future patent applications, and we hope their increased discoverability will improve the quality of patents in the U.S. and worldwide.
So if you want to learn about a
Chinese dual-drive bicycle
, a
German valve for inflating bicycle tires
, attach a
Canadian trailer to your bike
, or read the
WIPO application for pedalling with one leg
, those and millions of other inventions are now available on Google Patents.
Thanks to
Google Translate
, all patents are available in both their original languages and in English, and you can search across the world’s patents using terms in any of those languages. When there are multiple submission languages, you can move between them with a single click on the tabs at the top of the page, as shown in the screenshot below:
Happy patent searching!
We are joining the Open edX platform
Tuesday, September 10, 2013
Posted by Dan Clancy, Director of Research
A year ago, we released
Course Builder
, an experimental platform for online education at scale. Since then, individuals have created courses on everything from game theory to philanthropy, offered to curious people around the world. Universities and non-profit organizations have used the platform to experiment with MOOCs, while maintaining direct relationships with their participants. Google has published a number of courses including
Introduction to Web Accessibility
which opens for registration today. This platform is helping to deliver on our goal of making education more accessible through technology, and enabling educators to easily teach at scale on top of cloud platform services.
Today, Google will begin working with
edX
as a contributor to the open source platform, Open edX. We are taking our learnings from Course Builder and applying them to Open edX to further innovate on an open source MOOC platform. We look forward to contributing to edX’s new site, MOOC.org, a new service for online learning which will allow any academic institution, business and individual to create and host online courses.
Google and edX have a shared mission to broaden access to education, and by working together, we can advance towards our goals much faster. In addition, Google, with its breadth of applicable infrastructure and research capabilities, will continue to make contributions to the online education space,
the findings of which
will be shared directly to the online education community and the Open edX platform.
We support the development of a diverse education ecosystem, as learning expands in the online world. Part of that means that educational institutions should easily be able to bring their content online and manage their relationships with their students. Our industry is in the early stages of MOOCs, and lots of experimentation is still needed to find the best way to meet the educational needs of the world. An open ecosystem with multiple players encourages rapid experimentation and innovation, and we applaud the work going on in this space today.
We appreciate the community that has grown around the Course Builder open source project. We will continue to maintain Course Builder, but are focusing our development efforts on Open edX, and look forward to seeing edX’s MOOC.org platform develop. In the future, we will provide an upgrade path to Open edX and MOOC.org from Course Builder. We hope that our continued contributions to open source education projects will enable anyone who builds online education products to benefit from our technology, services and scale. For learners, we believe that a more open online education ecosystem will make it easier for anyone to pick up new skills and concepts at any time, anywhere.
Make Your Websites More Accessible to More Users with Introduction to Web Accessibility
Tuesday, September 10, 2013
Eve Andersson, Manager, Accessibility Engineering
Cross-posted with
Google Developer's Blog
You work hard to build clean, intuitive websites. Traffic is high and still climbing, and your website provides a great user experience for all your users, right? Now close your eyes. Is your website easily navigable? According to the World Health Organization, 285 million people are visually impaired. That’s more than the populations of
England
,
Germany
, and
Japan
combined!
As the web has continued to evolve, websites have become more interactive and complex, and this has led to a reduction in accessibility for some users. Fortunately, there are some simple techniques you can employ to make your websites more accessible to blind and low-vision users and increase your potential audience.
Introduction to Web Accessibility
is Google’s online course that helps you do just that.
You’ll learn to make easy accessibility updates, starting with your HTML structure, without breaking code or sacrificing a beautiful user experience. You’ll also learn tips and tricks to inspect the accessibility of your websites using Google Chrome extensions. Introduction to Web Accessibility runs with support from Google content experts from September 17th - 30th, and is recommended for developers with basic familiarity with HTML, JavaScript, and CSS.
There’s a lot to learn in the realm of web accessibility, and a lot of work to be done to ensure users aren’t excluded from being able to easily navigate the web. By introducing fundamental tips to improve web usage for users with visual impairments, Introduction to Web Accessibility is a starting point to learn how to build accessibility features into your code.
Registration
is now open, so sign up today and help push the web toward becoming truly universally accessible.
A Comparison of Five Google Online Courses
Thursday, September 05, 2013
Posted by Julia Wilkowski, Senior Instructional Designer
Google has taught five open online courses in the past year, reaching nearly 400,000 interested students. In this post I will share observations from experiments with a year’s worth of these courses. We were particularly surprised by how the size of our courses evolved during the year; how students responded to a non-linear, problem-based MOOC; and the value that many students got out of the courses, even after the courses ended.
Observation #1: Course size
We have seen varying numbers of registered students in the courses. Our first two courses (Power Searching versions one and two) garnered significant interest with over 100,000 students registering for each course. Our more recent courses have attracted closer to 40,000 students each. It’s likely that this is a result of initial interest in MOOCs starting to decline as well as students realizing that online courses require significant commitment of time and effort. We’d like other MOOC content aggregators to share their results so that we can identify overall MOOC patterns.
*based on surveys sent only to course completers. Other satisfaction scores represent aggregate survey results sent to all registrants.
Observation #2: Completion rates
Comparing these five two-week courses, we notice that most of them illustrate a completion rate (measured by the number of students who meet the course criteria for completion divided by the total number of registrants) of between 11-16%. Advanced Power Searching was an outlier at only 4%. Why? A possible answer can be found by comparing the culminating projects for each course: Power Searching consisted of students completing a multiple choice test; Advanced Power Searching students completed case studies of applying skills to research problems. After grading their work, students also had to solve a final search challenge.
Advanced Power Searching also differed from all of the other courses in the way it presented content and activities. Power Searching offered videos and activities in a highly structured, linear path; Advanced Power Searching presented students with a selection of challenges followed by supporting lessons. We observed a decreasing number of views on each challenge page similar to the pattern in the linear course (see figure 1).
Figure 1. Unique page views for Power Searching and Advanced Power Searching
Students who did complete Advanced Power Searching expressed satisfaction with the course (95% of course completing students would recommend the course to others, compared with 94% of survey respondents from Power Searching). We surmise that the lower completion rate for Advanced Power Searching compared to Power Searching could be a result of the relative difficulty of this course (it assumed significantly more foundational knowledge than Power Searching), the unstructured nature of the course, or a combination of these and other factors.
Even though completion rates seem low when compared with traditional courses, we are excited about the sheer number of students we’ve reached through our courses (over 51,000 earning certificates of completion). If we offered the same content to classrooms of 30 students, it would take over four and a half years of daily classes to teach the same information!
Observation #3: Students have varied goals
We would also like to move the discussion beyond completion rates. We’ve noticed that students register for online courses for many different reasons. In Mapping with Google, we asked students to select a goal during registration. We discovered that
52% of registrants intended to complete the course
48% merely wanted to learn a few new things about Google’s mapping tools
Post-course surveys revealed that
78% of students achieved the goal they defined at registration
89% of students learned new features of Google Maps
76% reported learning new features of Google Earth
Though a much smaller percentage of students completed course requirements, these statistics show that many of the students attained their learning goals.
Observation #4: Continued interest in post-course access
After each course ended, we kept many of the course materials (videos, activities) available. Though we removed access to the forums, final projects/assessments, and teaching assistants, we have seen significant interest in the content as measured by Google and YouTube Analytics. The Power Searching course pages have generated nearly three million page views after the courses finished; viewers have watched over 160,000 hours (18 years!) of course videos. In the two months since Mapping with Google finished, we have seen over 70,000 unique visitors to the course pages.
In all of our courses, we saw a high number of students interested in learning online: 96% of Power Searching participants agreed or strongly agreed that they would take a course in a similar format. We have succeeded in teaching tens of thousands of students to be more savvy users of Google tools. Future posts will take an in-depth look at our experiments with self-graded assessments, community elements that enhance learning, and design elements that influence student success.
Google Research Awards: Summer 2013
Monday, August 12, 2013
Posted by Maggie Johnson, Director of Education & University Relations
Another round of the
Google Research Awards
is complete. This is our biannual open call for proposals on computer science-related topics including machine learning and structured data, policy, human computer interaction, and geo/maps. Our grants cover tuition for a graduate student and provide both faculty and students the opportunity to work directly with Google scientists and engineers.
This round, we received 550 proposals from 50 countries. After expert reviews and committee discussions, we decided to fund 105 projects. The subject areas that received the highest level of support were human-computer interaction, systems and machine learning. In addition, 19% of the funding was awarded to universities outside the U.S.
We noticed some new areas emerging in this round of proposals. In particular, an increase of interest in neural networks, accessibility-related projects, and some innovative ideas in robotics. One project features the use of
Android-based
multi-robot systems which are significantly more complex than single robot systems. Faculty researchers are looking to explore novel uses of
Google Glass
such as an indoor navigation system for blind users, and how Glass can facilitate social interactions.
Congratulations to the well-deserving
recipients of this round’s awards
. If you are interested in applying for the next round (deadline is October 15), please visit
our website
for more information.
Computer Science Teaching Fellows Starting Up in Charleston, SC
Wednesday, August 07, 2013
Posted by Cameron Fadjo, Program Lead, Computer Science Teaching Fellows
Google recently started up an exciting new program to ignite interest in computer science (CS) for K12 kids. Located in our
South Carolina data center
, the Computer Science Teaching Fellows is a two-year post graduate fellowship for new STEM teachers and CS graduates. The goal is to bring computer science and computational thinking to
all
children, especially underrepresented minorities and girls, and close the gap between the ever-increasing demand in CS and the inadequate supply. We hope to learn what really works and scale those best practices regionally and then nationally.
The supply of CS majors in the pipeline has been a concern for many years. In 2007, the Computer Science education community was alarmed by the lack of CS majors and enrollments in US colleges and universities.
Source: 2009-2010 CRA Taulbee Survey (
http://www.cra.org/resources/
)
This prompted the development of several programs and activities to start raising awareness about the demand and opportunities for computer scientists, and to spark the interest of K12 students in CS. For example, the
NSF
funded curriculum and professional development around the new
CS Principles
Advanced Placement course. The
CSTA
published
standards
for K12 CS and a
report
on the limited extent to which schools, districts and states provide CS instruction to their students. CS advocacy groups,
Computing in the Core
and
Code.org
have played an instrumental role in adding provisions to the reauthorization of the
Elementary and Secondary School Act
to
support CS education
. More generally, we have seen innovations in online learning with
MOOCs
,
machine learning
to provide personalized learning experiences, and platforms like
Khan Academy
that allow flipped classrooms.
All of these activities represent a convergence in the CS education space, where existing programs are ready for scale, and technological advancements can support that scale in innovative ways. Our Teaching Fellows will be testing after school programs, classroom curriculum and online CS programs to determine what works and why. They’ll start in the local Charleston area and then spread the best programs and curriculum to South Carolina, Georgia, North Carolina (where we also have large data centers). They are currently preparing programs for the fall semester.
We are very excited about the convergence we are seeing in CS education and the potential to bring many more kids into a field that offers not only great career opportunities but also a shot at really making a difference in the world. We’ll keep you posted on the progress of our Teaching Fellows.
Under the hood of Croatian, Filipino, Ukrainian, and Vietnamese in Google Voice Search
Thursday, July 25, 2013
Posted by Eugene Weinstein and Pedro Moreno, Google Speech Team
Although we’ve been working on speech recognition for several years, every new language requires our engineers and scientists to tackle unique challenges. Our most recent additions - Croatian, Filipino, Ukrainian, and Vietnamese - required creative solutions to reflect how each language is used across devices and in everyday conversations.
For example, since Vietnamese is a
tonal language
, we had to explore how to take tones into consideration. One simple technique is to model the tone and vowel combinations (
tonemes
) directly in our lexicons. This, however, has the side effect of a larger phonetic inventory. As a result we had to come up with special algorithms to handle the increased complexity. Additionally, Vietnamese is a heavily diacritized language, with tone markers on a majority of syllables. Since Google Search is very good at returning valid results even when diacritics are omitted, our Vietnamese users frequently omit the diacritics when typing their queries. This creates difficulties for the speech recognizer, which selects its vocabulary from typed queries. For this purpose, we created a special diacritic restoration algorithm which enables us to present properly formatted text to our users in the majority of cases.
Filipino also presented interesting challenges. Much like in other multilingual societies such as Hong Kong, India, South Africa, etc., Filipinos often mix several languages in their daily life. This is called
code switching
. Code switching complicates the design of pronunciation, language, and acoustic models. Speech scientists are effectively faced with a dilemma: should we build one system per language, or should we combine all languages into one?
In such situations we prefer to model the reality of daily language use in our speech recognizer design. If users mix several languages, our recognizers should do their best in modeling this behavior. Hence our Filipino voice search system, while mainly focused on the Filipino language, also allows users to mix in English terms.
The algorithms we’re using to model how speech sounds are spoken in each language make use of our distributed large-scale
neural network
learning infrastructure (yes, the same one that spontaneously
discovered cats
on YouTube!). By partitioning the gigantic parameter set of the model, and by evaluating each partition on a separate computation server, we’re able to achieve unprecedented levels of parallelism in training acoustic models.
The more people use Google speech recognition products, the more accurate the technology becomes. These new neural network technologies will help us bring you lots of improvements and many more languages in the future.
11 Billion Clues in 800 Million Documents: A Web Research Corpus Annotated with Freebase Concepts
Wednesday, July 17, 2013
Posted by Dave Orr, Amar Subramanya, Evgeniy Gabrilovich, and Michael Ringgaard, Google Research
“I assume that by knowing the truth you mean knowing things as they really are.”
- Plato
When you type in a search query -- perhaps
Plato
-- are you interested in the string of letters you typed? Or the concept or entity represented by that string? But knowing that the string represents something real and meaningful only gets you so far in computational linguistics or information retrieval -- you have to know what the string actually refers to. The
Knowledge Graph
and
Freebase
are databases of things, not strings, and references to them let you operate in the realm of concepts and entities rather than strings and n-grams.
We’ve previously released
data to help with disambiguation
and recently awarded
$1.2M in research grants
to work on related problems. Today we’re taking another step: releasing data consisting of nearly 800 million documents automatically annotated with over 11 billion references to Freebase entities.
These Freebase Annotations of the ClueWeb Corpora (FACC) consist of
ClueWeb09 FACC
and
ClueWeb12 FACC
. 11 billion phrases that refer to concepts and entities in Freebase were automatically labeled with their unique identifiers (
Freebase MID’s
). For example:
Since the annotation process was automatic, it likely made mistakes. We optimized for precision over recall, so the algorithm skipped a phrase if it wasn’t confident enough of the correct MID. If you prefer higher precision, we include confidence levels, so you can filter out lower confidence annotations that we did include.
Based on review of a sample of documents, we believe the precision is about 80-85%, and recall, which is inherently difficult to measure in situations like this, is in the range of 70-85%. Not every ClueWeb document is included in this corpus; documents in which we found no entities were excluded from the set. A document might be excluded because there were no entities to be found, because the entities in question weren’t in Freebase, or because none of the entities were resolved at a confidence level above the threshold.
The ClueWeb data is used in multiple TREC tracks. You may also be interested in our annotations of several
TREC query sets
, including those from the
Million Query Track
and
Web Track
.
If you would prefer a human-annotated set, you might want to look at the
Wikilinks Corpus
we released last year. Entities there were disambiguated by links to Wikipedia, inserted by the authors of the page, which is effectively a form of human annotation.
You can find more detail and download the data on the pages for the two sets:
ClueWeb09 FACC
and
ClueWeb12 FACC
. You can also subscribe to our
data release mailing list
to learn about releases as they happen.
Special thanks to Jamie Callan and Juan Caicedo Carvajal for their help throughout the annotation project.
Labels
accessibility
ACL
ACM
Acoustic Modeling
Adaptive Data Analysis
ads
adsense
adwords
Africa
AI
Algorithms
Android
Android Wear
API
App Engine
App Inventor
April Fools
Art
Audio
Augmented Reality
Australia
Automatic Speech Recognition
Awards
Cantonese
Chemistry
China
Chrome
Cloud Computing
Collaboration
Computational Imaging
Computational Photography
Computer Science
Computer Vision
conference
conferences
Conservation
correlate
Course Builder
crowd-sourcing
CVPR
Data Center
Data Discovery
data science
datasets
Deep Learning
DeepDream
DeepMind
distributed systems
Diversity
Earth Engine
economics
Education
Electronic Commerce and Algorithms
electronics
EMEA
EMNLP
Encryption
entities
Entity Salience
Environment
Europe
Exacycle
Expander
Faculty Institute
Faculty Summit
Flu Trends
Fusion Tables
gamification
Gboard
Gmail
Google Accelerated Science
Google Books
Google Brain
Google Cloud Platform
Google Docs
Google Drive
Google Genomics
Google Maps
Google Photos
Google Play Apps
Google Science Fair
Google Sheets
Google Translate
Google Trips
Google Voice Search
Google+
Government
grants
Graph
Graph Mining
Hardware
HCI
Health
High Dynamic Range Imaging
ICLR
ICML
ICSE
Image Annotation
Image Classification
Image Processing
Inbox
Information Retrieval
internationalization
Internet of Things
Interspeech
IPython
Journalism
jsm
jsm2011
K-12
KDD
Keyboard Input
Klingon
Korean
Labs
Linear Optimization
localization
Low-Light Photography
Machine Hearing
Machine Intelligence
Machine Learning
Machine Perception
Machine Translation
Magenta
MapReduce
market algorithms
Market Research
Mixed Reality
ML
MOOC
Moore's Law
Multimodal Learning
NAACL
Natural Language Processing
Natural Language Understanding
Network Management
Networks
Neural Networks
Nexus
Ngram
NIPS
NLP
On-device Learning
open source
operating systems
Optical Character Recognition
optimization
osdi
osdi10
patents
Peer Review
ph.d. fellowship
PhD Fellowship
PhotoScan
Physics
PiLab
Pixel
Policy
Professional Development
Proposals
Public Data Explorer
publication
Publications
Quantum AI
Quantum Computing
renewable energy
Research
Research Awards
resource optimization
Robotics
schema.org
Search
search ads
Security and Privacy
Semantic Models
Semi-supervised Learning
SIGCOMM
SIGMOD
Site Reliability Engineering
Social Networks
Software
Speech
Speech Recognition
statistics
Structured Data
Style Transfer
Supervised Learning
Systems
TensorBoard
TensorFlow
TPU
Translate
trends
TTS
TV
UI
University Relations
UNIX
User Experience
video
Video Analysis
Virtual Reality
Vision Research
Visiting Faculty
Visualization
VLDB
Voice Search
Wiki
wikipedia
WWW
YouTube
Archive
2018
Apr
Mar
Feb
Jan
2017
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2016
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2015
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2014
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2013
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2012
Dec
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Nov
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Dec
Nov
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Jul
May
Apr
Mar
Feb
2007
Oct
Sep
Aug
Jul
Jun
Feb
2006
Dec
Nov
Sep
Aug
Jul
Jun
Apr
Mar
Feb
Feed
Google
on
Follow @googleresearch
Give us feedback in our
Product Forums
.