All in the Mind
Saturdays at 8am, repeated Sundays at 5am
Full Transcript: 8 September
Monday 8 September 2003
Produced by Sue Clark
Richard Aedy: Hello Iím Richard Aedy, welcome to The Buzz. This week: every book ever written online.
Raj Reddy: If you scanned a page every second, it would take a hundred years to get a million books and it would take ten thousand years to get a hundred million books.
Richard Aedy: Yes itís quite a task and weíll hear more about it later in the show. But we begin this week with a tragic anniversary.
ABC RADIO NEWS THEME
President Bush: Freedom itself was attacked this morning by a faceless coward and freedom will be defended. Make no mistake, the United States will hunt down and punish those responsible.
Newsreader: US President George W Bush speaking after a wave of attacks struck at the heart of Americaís military, economic and political establishments. Good morning, John Logan with ABC News. The once towering World Trade Centre in New YorkÖ
Richard Aedy: 2792 people died in the attack on the World Trade Centre and the job of identifying went to the office of the Chief Medical Examiner of New York City. The sheer size of the task promoted the doctors to ring Howard Cash, founder of Gene Codes, a bioinformatics company with a well-regarded DNA sequencing program.
That phone call changed Howard Cashís life and led Gene Codes to develop a new kind of identification software called mass fatality identification system or M-FISys. Hereís Howard Cash.
Howard Cash: Actually when I got that phone call I thought that the medical examiner was going to ask us to donate some of our software that weíve been selling to laboratories for the previous, you know, dozen years. And of course I was completely prepared to donate copies but they were very insistent that we should come to New York and I had no idea what the level of the problem was until I got there.
Richard Aedy: Because the situation is actually more complicated than just possibly 20,000 people dead isnít it, it was more than enormous numbers?
Howard Cash: Yes, there were an enormous number of fragments, to date coincidentally you picked that number, but 20,000 remains just about have been recovered for those 2700 people. But the other thing to consider is that not only were they fragmented but they were out in the elements, they were on a pile of rubble with heavy equipment going across it on a pile of burning jet fuel, soaked in water while people tried to put out those jet fuel fires. The amount of compromise, the amount of damage done to the human tissue that was left was enormous.
Richard Aedy: And some people, some individuals, were in as many as 200 pieces.
Howard Cash: Thatís true, there were something over 200 intact bodies more or less, maybe with injuries, and people in many pieces and I think in the most extreme case, I canít say the worst case, they were all terrible, those were in about 200 pieces.
Richard Aedy: I also remember reading shortly afterwards that you had examples of bone from one person being found inside tissue from another person, just to really complicate things.
Howard Cash: Thatís true and you remember there are 210 storey skyscrapers that came down. One of the striking things about Ground Zero for those who worked there, not even as much what we found but what wasnít found is 20% of all the office space in Manhattan on the ground. And there was no office furniture, no computers and desks, no doors, so you can imagine what it did to the people. And not only were they torn to pieces by falling steel and reinforced concrete but they were co-mingled. A lot of effort was taken to try and make sure, as remains came to the morgue, the anthropologist who was the first to review them would try and ensure that nothing went into a body bag that could possibly be more than one person but it really wasnít always possible to tell.
Richard Aedy: All of this massive destruction meant that the usual methods of identifying people, Iím thinking of fingerprints and dental records in particular, only took you so far.
Howard Cash: Thatís true, there were quite a few people who were identified initially, or maybe parts of their remains were identified initially, by fingerprints of dental records. But when I say something specific I hope your listeners wonít think Iím being disrespectful to the memories of the people killed, but thereís no other way to communicate the information. If you find somebodyís leg or an individual vertebrae there are no fingerprints, there are no teeth to work with.
Richard Aedy: Now there was DNA analysis done and there was already some large DNA analysis programs kind of in operation fairly quickly afterwards.
Howard Cash: Well the first attempts were using something called CoDIS, it was developed by the FBI here in the States and itís widely used elsewhere in the country for matching DNA fingerprints as theyíre commonly called for crime scenes.
Richard Aedy: DNA fingerprints, these are brief stretches out of the three billion bases that make up our DNA that repeat over and over, and itís matching these at different locations on the DNA isnít it?
Howard Cash: Right, so in a particular one of these locations each of us will have two copies of the stretch of DNA Ė one from our mother, one from our father Ė and depending on these short tandem repeats, youíll remember from high school that DNA is made up of kind of four building blocks Ė A C G and T. And you can spell anything, make any gene from those four letters like you can make any English book by combining the 26 letters of our alphabet. But these short repeats might say ACTT, ACTT, ACTT, ACTT and when I measure how long that stretch of DNA is, I might find out that I got 12 from one of my parents and 15 from another parent. And at that same location you got 9 from one parent and 20 from your other parent. And then we check 13 of those locations, so itís like a combination lock that instead of turning three times right to left, you turn 13 times. The chances of any two people having a whole combination identical other than identical twins is about one in 200-500 trillion.
Richard Aedy: That would seem to be fairly comprehensive, what was the limitation of using DNA fingerprinting when it came to identifying victims of September 11th?
Howard Cash: Well the good thing about this kind of DNA that comes in the nucleus of every cell, is that itís very, very specific, just like I described. The downside is that itís rather fragile and itíll start to break down when you die, itíll certainly start to break down when human remains are exposed to temperatures in the area of 2000 degrees Fahrenheit where jet fuel is burning in an enclosed area. So it began to degrade, we might only get a few of the numbers, maybe at the 13 locations we try, we might only get two or three or five. And thatís not enough specificity to say in a population of almost 3000 missing people that we know for sure who this person is. So we had to go to extra efforts to get a little more information to give us enough certainty to return remains to a family so they could have a proper burial.
Richard Aedy: Your approach was to build an analysis program that could handle more than one type of information about DNA.
Howard Cash: Right. When we started out we had several goals, two immediate problems were one, how do we combine different types of tests? And how do we reduce the size of the haystack we were looking for by finding Ė to use your example Ė someone in 200 pieces and not to have to look at that 200 times. So the first step was to say when remains, even if we donít who they are, are clearly one and the same person to kind of collapse that into a type of representative record. And for that record we wanted to add several different kinds of tests, the STRs, the short tandem repeats that are commonly called DNA fingerprints was the one we tried first. Thereís another kind of DNA in every cell and itís more plentiful called mitochondrial DNA. The mitochondria is the little organ inside every cell and actually there might be 500 of these little organs in every cell, they are the power plant of the cell.
Richard Aedy: Yeah, they generate the energy that all cells run on donít they?
Howard Cash: Absolutely and they have a little bit of DNA of their own but instead of 3 billion characters itís about 16,500 characters in a little circle. Itís circular, itís smaller, itís tougher, itís hardy material and you can usually get mitochondrial DNA out of a mummy Ė itís very likely to last. So sometimes when the nuclear DNA that would be used for a fingerprint had degraded beyond where it could be recovered in the lab, the mitochondrial DNA still survived.
Richard Aedy: But the think about mitochondrial DNA though is we only inherit it from our mothers.
Howard Cash: Thatís true, you donít get it from parents and the other thing is that itís just not as specific.
Richard Aedy: Yeah, because itís so much shorter, youíre dealing with so much less of it.
Howard Cash: Yep, it doesnít have those short tandem repeats, we look at individual character differences that have just diverged over the course, a fairly short course of human evolution. And you only get it from one parent so it doesnít continually get recombined as you said. The result is the most common pattern we found in that mitochondrial circle of DNA is found in about 7% of the Caucasian population and thatís not enough to make an ID. It might be enough exclude somebody based on a partial STR, a partial fingerprint, we could say Ďwell we know it has to be one of these three people who share those locations.í And if the mitochondrial can exclude two of them because two of them donít match and one of them does, that might be enough to return somebody to their family.
Richard Aedy: So your program in effect tries to build up a virtual individual for each person youíre trying to identify. And you put in the DNA profiling information and then if you canít exclude someone you add the mitochondrial analysis.
Howard Cash: Uh-huh.
Richard Aedy: But that isnít going to get you everyone is it, that is still going to leave you with a large number of people whom youíre just not sure about?
Howard Cash: Weíre at that point now, as of the 2792 people who died as of 6.40 this morning, only 1524 of those people have been identified, so only a little more than half. Some will never be found, weíll try to find everybody but were effectively cremated and there may not be anything left to find. But for those who are we want to try everything available in the lab and the third technology thatís being applied is the technology that came out of the Human Genome Project called SNPs Ė it stands for single nuclear tied polymorphism. Instead of looking at a length of DNA youíre interrogating a single character and saying Ďare you a G, an A, a T or a C?í And you may only need about 50 or 60 characters in a row that are intact to do that test. So even if the DNAís badly broken up we might still be able to get some information that way.
Richard Aedy: Right, because we know that there is a sequence over this 50 or 60 characters and at some point though you get a replacement of one letter by another letter.
Howard Cash: Yeah, the ones that weíve studied are ones that are known to vary, these are specific spots that through all the study by all the biologists around the world, at that particular position one of two characters is found. In our case itís either a C or itís a T, As and Gs are never found. So when we look at it weíll find that in every case that position the person has a C or a T, or a C and a T, one from each parent, thatís what weíre looking for and weíre looking for someone who has that same pattern. And you can look at inheritance from the parents with this the same way you do with normal nuclear DNA.
Richard Aedy: What are the chances of identifying someone incorrectly when you have all of this together and with your analysis program youíre able to kind of lay out one of these, each of these, on top of the other?
Howard Cash: Well that question is asked a lot. There are some things that could lead to a wrong answer that involve some incoming information thatís wrong but how sure we have to be actually is a policy decision. If thereís a statistical chance of inheriting certain characteristics from each parent and the decision that was made at the beginning of this project by the Chief Medical Examiner in the City of New York was that the certainty of a direct match, matching the DNA from remains that were found at the site to pre-existing patterns, like we might have gotten DNA from somebodyís toothbrush Ė how sure do we have to be about the match? Well it had to be 1010 or better. Less than one chance in 1010 that this would have occurred anywhere else in the population, anywhere in the world based on the populations that we know. And there are two ways we can identify someone, by comparing their DNA to a known sample of their DNA collected before they died, thatís getting DNA from their toothbrush or from an old surgical sample. The other thing is to compare the DNA to the DNA of their parents or their children or their siblings. Kinship comparisons are inherently less specific and they therefore require more points of reference, we like to have both parents. For kinship, the threshold set by the chief medical examiner is we have to be 99.9% sure and with only one potential match in all the known profiles. Remember in a population of 3000 victims if you were 99% sure that means it could still be any of 30 people, narrowing it down to at least no more than 1 in 3 and it canít match anybody else, it canít be a possibility to match anyone else. And because thatís still a little fuzzy we require that we identify somebody by at least two different modes of identification. Maybe by a direct match and kinship or by a partial dental record and a DNA match.
Richard Aedy: Because a false positive Ė when you ID someone and tell their family that you have and that later turns out to be wrong Ė that would be a disaster wouldnít it?
Howard Cash: That is my greatest nightmare. The worst case is not that somebody wonít be identified although that is terrible for that family. The worst case is that someone is identified, a funeral is held and a mistake was made.
Richard Aedy: Now all software that Iíve ever encountered in my life gets upgraded, have you been able to tweak this program over the last two years?
Howard Cash: We released the first version on December 13th 2001, we were hired effectively on October 8th, so in a couple of months we put the first version together. Between September 11th and December 13th 105 identifications had been made by DNA. The day we turned the program on for the first time we found 55 new matches that had not been found before just because it was too big a mountain of information to look through. They turned out to be positive IDs. Since then, almost every single week weíve released an upgrade. That requires an enormous level of testing, releasing live production code every week is impossible but this whole project was impossible.
Richard Aedy: So youíre up to sort of the 60th or 70th or beyond iteration now?
Howard Cash: I think weíre in our 80s at this point. We have some disciplines though we have, all of our engineers have to write Ė before they start working on a new feature thatís been requested by the medical examiner Ė they have to write the test that that feature will pass so we can run it through an automated test to make sure that it works. And they have to be approved by our QA department. Then when they make those changes, before they can put it into the final program it has to pass that test and the whole program has to pass every other test thatís been written since the beginning of time and which for us is about the middle of October, 2001. And so there are about 2000, more than 2000 automated tests that get run 12 or 16 times every single day, itís very tough to make a fix in one part of the program and break another part of the program without it becoming very obvious to everybody.
Richard Aedy: Now Howard this is not your core business, this has grown like Topsy to take over your company but youíre not really making money out of this are you?
Howard Cash: Our main business was a product called Sequencher, itís just a commercial software program used by laboratory biologists. We were involved with the Human Genome Project, we have a lot of every major pharmaceutical company in the world uses that software. We took all of our engineers and our entire QA department except for one tester, one QA person and one engineer, off of Sequencher and put everything we had plus everybody we could hire onto nothing but the World Trade Centre. And originally we thought weíd be on this for 10 to 12 months and now it looks like weíll be on it for maybe a full three years. When we started the project 100% of our staff and all 11 of our shareholders agreed that we would not try and profiteer on this project and we said Ďweíd do it at cost or as close to it as we could.í In the end, now two years in and looking at three years, weíre actually making more than weíre spending so there is a kind of a line profit on it. The real cost to us is that our main business, the thing that will still be paying our payrolls when the World Trade Centre effort is finished, has not been updated; we havenít done an update in two years. And you know itís axiomatic in the software business if you will, you donít make money selling software you make money selling upgrades. Well weíre not doing any upgrades so when that tap gets turned off from the city of New York weíre going to have to scramble a little bit but every single person on the staff here would do it again in a heartbeat.
Richard Aedy: The software you came up with though, it occurs to me it can be used in other mass disasters too. There are earthquakes and hurricanes every single year.
Howard Cash: There are earthquakes and hurricanes, the Australian Federal Police laboratory was very involved in the recovery from the Bali bombing in November 2002. Jim Robertson and Linzi Wilson-Wilde at the AFP in Canberra, I visited them in December of last year. And we kind of compared our experiences and we hope to build something better out of both of our experiences so that the next people donít have to start from scratch.
Richard Aedy: Well Howard thanks very much for joining us on The Buzz Ė we really appreciate it.
Howard Cash: Glad to talk with you.
Richard Aedy: Howard Cash is President of Gene Codes Forensics in Ann Arbor, Michigan.
This is The Buzz Ė our weekly technology show on Radio National and Radio Australia Ė Iím Richard Aedy.
Imagine for a moment if you could have instant access to all human knowledge anywhere in the world. Well itís going to be possible according to Raj Reddy, Professor of Computer Science and Robotics at Carnegie Mellon University. And the first step along the way is the Million Book Digital Library Project, a modest proposal to put a million books online.
Raj Reddy: Now weíre beginning to scan books in multiple languages and so by the end of this year because of this project we will have 100,000 books online and in the next year or two we should get to a million book point.
Richard Aedy: Letís work out the scale of the enterprise, how many books are there in existence?
Raj Reddy: They may not be in existence but we estimate from the time the Gutenberg Press was invented there are probably about 100 million books that have been printed and published in all the languages, of which there is only a record of 42 million books in the OCLC catalogue.
Richard Aedy: OCLC?
Raj Reddy: Is the catalogue Ė a union of all the books that are in all of the libraries in the United States and other countries, some of the other participating countries. So we think there are 100 million books and if you read a book every day from the time youíre born to the time you die, you can only read about 40,000 books, thereís no way you would be able to read all the books, even a million books, in a lifetime.
Richard Aedy: If weíre going to have thereís 100,000 available by the end of the year and the million is I guess the short-term aim, what about being ambitious and going for the 100 million in the long-term?
Raj Reddy: Thatís exactly where we need to go but it turns out if you scan a page every second it would take 100 years to get to a million books and itíll take 10,000 years to get to 100 million books so thatís the magnitude of the problem. So what, instead of scanning one page every second, you have to scan 100 pages every second so that you can get the one million books in one year instead of 100 years. Fortunately this task is infinitely parallisible, if every student in the world will scan one book the whole thing would be done in a week, because there are about 100 million students that are attending schools and if each one would do one book we are done. But the issue is logistics and co-ordination and finding and getting the book into the scanning centres and so on, which is time consuming.
Richard Aedy: So what are the technical challenges that have to be solved to make this a realistic prospect?
Raj Reddy: The technical challenges are twofold. One is when you take a book, unless you cut it up, thereís a curvature so when you image a book with curved pages characters change shape and so thatís not very convenient and if you try to now recognise those characters with OCR systems theyíll make a lot of errors. But thereís image processing technology which would automatically convert a curved page into a flat page and so there are other things like when you scan a page itís like you tilt it, or you know your thumb prints are there, or some other noise is there. Removing the noise and cleaning it up takes substantial effort.
Once youíve done all of that now you have to optical character recognition (OCR) so that those words are recognisable and searchable by computers (rather) than simply looking at a photograph or an image of the page.
And so thatís where some of the technical challenges are right now but there are research challenges, letís say you have those books already online, it turns out once you get all the books of the world online, a billion people might access it every day. The scalability of the enterprise that has nothing to do with the content, once it is there connectivity, global connectivity at high bandwidths and being able to ask a multi-lingual information retrieval where you ask a question in one language, it retrieves answers in all languages and then converts it back to that language so that the person that asked it is able to read it. So those are currently unsolved problems and they will be solved over the next 10, 20 years because they are interesting research problems.
Similarly thereís the problem of what we call summarisation. It turns out if I give you a book and say Ďread this book and write a ten page summaryí youíll do a good job but we donít have a computer that can do the same thing yet. And this is what we call an ĎAI complete problem,í we donít know how to solve it, we donít have computers that can truly understand written English, or written French, or written whatever well enough to translate or summarise or you know and give you the information like a human being seems to be able to do. But there are systems thatíll do 80% good job or a 50% good job but if all that youíre trying to do is find something that is related to your interest and to retrieve it and then you can spend more time, if you want, manually translating it.
Richard Aedy: Now thatís some of the technical charges, certainly not all of them that would be faced but it occurs to me thereís going to be other sorts of challenges too. Iím thinking of something like copyright, thereís going to be issues of copyright and I guess other policy concerns that will have to be solved?
Raj Reddy: We have technical challenges, we have research challenges, then we have policy challenges. The policy challenges are the main one is the copyright problem. It turns out of the 100 million books about 92 million are out of print but still in copyright. That means nobody can access them so the policy question is are there new policies governments of the world can promulgate which would have the effect of compensating the authors and the creators of these works and at the same time make it available for the public good? Right now itís out of print, nobodyís making any money on it, but itís still in copyright so nobody can access it, so itís a lose-lose proposition. So we need to come up with a win-win proposition where the authors get a tax break and if all of a sudden their book becomes very popular maybe they should also get paid on the basis of usage. If your book is checked out 10 thousand times maybe you should get a cheque for $10 or $100 or something from the government. And since we already spend billions of dollars on physical libraries setting aside some of that money for compensating the authors seems to make eminent sense. And finally making it possible for individuals to automatically contribute their new books and be able to get slightly higher payment every time itís downloaded. And itís not limited to books only, newspapers are already online and you want to be able to get at them, radio Ė NPR is already putting all of their broadcasts online so I can listen to it later except itís not always transcribed so I canít search for it.
Richard Aedy: Youíll be pleased to know that The Buzz is always transcribed you can search for this interview youíre doing right now.
Raj Reddy: OK.
Richard Aedy: Where are we with the project, I mean in terms of how much itíll cost and whoís going to need to contribute and participate, whatís happening now?
Raj Reddy: So the costs are, if we try to do it in the US or Australia itíll cost, to do a million books, itíll cost $100 to $200 million. But what we have done is made it a joint project between US, India and China Ė and India and China are providing the manpower at their cost and the US is providing the computers. And so and Iím sure Australia and Europe will join in at some point.
Richard Aedy: I saw you talking to the minister Richard Alston, have you been twisting his arm on this?
Raj Reddy: Yes, I said ĎAustralia has a unique opportunity to take a leadership role and say if youíre an author of a book thatís still in copyright but out of print, if you submitted it to this non-profit activity, you can take a tax deduction and weíll also give you an annual payment based on the amount of usage.í
Richard Aedy: How long is it going to take, you said perhaps five to ten years to get a million books, but 100 million books if we go back to Gutenberg Ė give us some dates.
Raj Reddy: The current plan is if we can get about 10 million books in the next 10 to 20 years, then thatíll become such an important resource for the world that everybody will start you know using it and then itíll build on itself. More importantly all the new books Ė which are born digital already Ė can go right in and we wonít have to do anything and so the access cost are much, much smaller.
Richard Aedy: And 100 million in the long-term Ė are you prepared to pub a date on it?
Raj Reddy: Yeah, 100 million by 2200, howís that? In 100 years I think weíll be there.
Richard Aedy: It boggles the mind. Raj Reddy from the Million Book Digital Library project.
And thatís it for this week. Remember our website is at abc.net.au/rn then click on The Buzz. Our email address is email@example.com.
The showís producer is Sue Clark, technical production by John Diamond. Iím Richard Aedy, thanks for listening.