Learning French

September 5th, 2011
news and informationbusiness,health,entertainment,technology automotive,business,crime,health,life,politics,science,technology,travel

I’m going to France for the next two weeks and thought it would be good to try to learn some of the language. When I was a kid, I had three years of French instruction in school, but it was by far my worst subject.

I got some French language instruction CD’s, which I am listening to in my car as I commute. They’re called “Michel Thomas Method French for Beginners”. I was afraid that these would be just as boring. In fact, they’re great fun to listen to and practice with.

The format is that there’s a French man teaching two students. After teaching a few words and a bit of grammar, he gives them a sentence to say, and I can try to formulate it before they do (or pause the CD), and then I get the immediate feedback.

When I was a kid learning French in school, we had to recite conjugations (“er”, “ir”, and “re” verbs, in many tenses), which was totally boring. Well, it turns out that you can say a whole lot of useful stuff without any of that. You can just use infinitives for all kinds of things (“je voudari manger”). So you only need to learn the present-tense conjugations of a few verbs and you can say all kinds of useful things.

And you don’t need to use the future tense at all, since, just as in English you can say “I am going to eat”, in French you can say literally the same thing, “je vais manger”. Cool.

Now, whether I’ll be able to construct sentences on the fly while someone is standing there waiting for me and evaluating whether I’m right, is another question. I can easily imagine myself being embarrassed; maybe under pressure I’ll forget it all.

Also, whether I can understand what’s spoken to me is yet another issue.

I downloaded some French-English apps for the Android phone, but so far haven’t found any free apps better than Google’s own Translate. I’d rather have one that doesn’t depend on network access, though. I’m still looking.

Anyway, doing this is fun, which is perhaps what really matters. So if you had the same negative language-learning experience as I did, try out doing it this way.

Seed Funding and Angel Groups: The Fast and The Furious

June 29th, 2011
news and informationbusiness,health,entertainment,technology automotive,business,crime,health,life,politics,science,technology,travel

I have written a blog post called Seed Funding and Angel Groups: The Fast and The Furious, which was posted on Dharmesh Shah’s On Startups blog. It’s about the speed at which entrepreneurs can acquire seed financing, whether angel groups or venture capital partnerships can move faster, and the how much all this matters.

I put it on Dharmesh’s blog at his request. If you have comments, and if it’s all the same to you, it’s probably better to put them on his blog rather than here, just to keep all comments in the same place.

Comments on “Urban Myths about NoSQL”

June 17th, 2011
news and informationbusiness,health,entertainment,technology automotive,business,crime,health,life,politics,science,technology,travel

Dr. Michael Stonebraker recently posted a presentation entitled “Urban Myths about NoSQL”. Its primary point is to defend SQL, i.e. relational, database systems against the claims of the new “NoSQL” data stores. Dr. Stonebraker is one of the original inventors of relational database technology, and has been one of the most eminent database researchers and practitioners for decades.

Many of the virtues of relational databases described here are specifically about a new and highly innovative RDDBMS called VoltDB. VoltDB is made by a company called VoltDB.com, of which Dr. Stonebraker is co-founder and CTO. (There is also a good writeup about VoltDB here.)

The following are some comments about four of the six points in the presentation. I don’t consider any of these to “debunk” the presentation or anything like that, but they point out considerations that I feel should be taken into account.

#1: SQL is too slow:

This argument assumes a perfect (or excellent) query optimizer. If you talk to anyone who has ever done a high-performance system in Oracle DB or DB/2, and you will find out about serious problems in query optimizers. I am not saying that rolling-your-own C code is the answer, but query strategies often have to be provided explicitly by the developer or DBA.

Stored procedures have a serious problem: you can’t interleave your own code with database operations. This can particularly be a problem if each stored procedure is its own transaction rather than an operation within a transaction, as in VoltDB. Existing large systems may not be able to operate within that constraint, although new systems designed with that in mind might not have any problem witht this.

The “to go a lot faster” requires the whole database to be in main memory, as it is with VoltDB (the points on the slides here do not apply to RDBMS’s other than VoltDB.) The reason VoltDB can get rid of buffer management is that there are no (disk) buffers. VoltDB need not do lock management because there is no concurrency control: you just run every transaction to completion, since there is no reason to interleave transactions, since there are no I/O waits.

This is great if it works for your application. In point #5, he says that most OLTP databases are not very big, e.g. < 1TB, and for a database that size, using main memory is quite feasiable these days. The requiredment for the sizes of OLTB databases will probably rise with time. Of course, computers and memory are also getting faster and larger for the same price.

#3: SQL Systems don’t scale

If you have ever been in involved in benchmarking, you know how difficult it is to interpret benchmark results. Is it possible that these results were obtained by choosing a benchmark that is particularly favorable to VoltDB? The only benchmark that really matters is your own application: they are all different. Of course, the problem with that is that it’s hard to port your application merely to test performance. But by ignoring that and looking at other benchmarks, it’s like looking for a lost key under the streetlight because it’s easier to look there. I’m not saying that these numbers are misleading, and certainly not that they are intentionally misleading, but they are very hard to interpret without knowing exactly what was benchmarked, how everything was tuned, and so on. I say this from my own experience, having done benchmarking of database systems for years.

(Also notes that by TPC-C, he does not mean the officially defined TPC-C benchmark; look it up and you’ll see that it is a huge, major project to do it. He means a very simplified example based on the key concepts in TPC-C. (You can see this in the academic papers by him and others.) That said, if you do want a micro-benchmark that is as close to what people agree to be a good measure of online transaction performance, this might be the best one can do.)

#5: ACID is too slow

ACID is great for software developers, providing them a very clean and easy-to-understand model. Ease of understanding is crucial for achiving simplicity, which is the Holy Grail of software developement, enhancing maintainability and correctness. I’m all for ACID.

To clarify something often not explained well: the NoSQL stores are ACID. It’s just that what they can do within one ACID transaction is usually quite limited. For example, a transaction might only be able to fetch a value (or store a value, or increment a value) given the key, and then the transaction is over. That operation is ACID.

In a classic RDBMS, you can do many operations within one transaction. Your program says “begin transaction” (sometimes this is tacit), and then you can do computations that include both code and database queries/updates, interleaved. At the end you say “commit transaction”. (During or at the end of a transaction, the DBMS might have to abort the transaction.)

Right now, very few DBMS’s provide true ACID properties in the way they are really used in practice, for two reasons. First, they run at reduded “isolation levels”, which means that the “I” in ACID is compromised. See my blog article for an explanation of this.

Second, one often wants to provide a way to recover from the failure of an entire data center. This is done by having a second data center that is far enough away that it won’t be damaged by the failure of the primary data center. This means you can keep going in the face of a “disaster” such as a regional power outage, a tsunami, etc.

The problem is that if the data center is far enough away to have truly independent failure modes, then the network connection will have latency so high that it is not feasible to do synchronous commits for every transaction that update the distant copy. Most often, commit results are sent asynchronously to the distant copy. If the local data center fails, any transactions that had beeen committed, but had not yet reached the distant copy, are lost. So these transactions were not durable, the “D” in ACID. So there is a tradeoff here. (People live with this by being willing to do manual fixups in the face of a disaster.)

As discussed above, VoltDB transaction do not allow you to interleave code in your application with transactions. (The stored procedures can run arbitrary code, in Java, but that’s not the same what I described above.)

#6: In CAP, choose AP over CA

I disagree that network partitions are not a major concern. Very simple local-area networks do not suffer from partitions and network failures much, but even a medium-size network is vulnerable, and networks in large data centers are quite vulnerable, as you can easily learn from network operations experts. For example, routers fail, or are misconfigured.

Both Amazon and Google have published papers about their large-scale data stores. The papers talk a lot about how they deal with network partitions. If partitions were so unlikey, why are these large companies taking the problem so seriously, and using rather sophisticated techniques to deal with the partitions? Also, the study of how to deal with network partitions has been a hot topic of research for the last 35 years; again, why would that be true if partitions were not an important concern?

So, as your network becomes larger and more complex, dealing with partitions becomes more and more of an issue. My impression (I may be wrong) is that the “sweet spot” for VoltDB, at least at the moment, is for distributed systems that are not at the kind of very-large scale of an Amazon or Google, and indeed for a much smaller scale, which makes network partitions much less of a problem. There’s nothing wrong with this at all; I’m just trying to clarify the issue and explain the reason for the controversy about this point.

Final Note

There has been an exciting explosion of innovative database technology in the last few years. Many different kinds of applications have different requirements. It’s great news for all of us that there are so many solutions at different points in the requirement space.

What are “Human-Generated Data” and “In-RAM Databases”?

May 24th, 2011
news and informationbusiness,health,entertainment,technology automotive,business,crime,health,life,politics,science,technology,travel

For thoughtful commentary on all kinds of database and data storage systems, one of the best sources is Curt Monash’s DBMS2 blog.  Recently he posted an article called Traditional Databases will eventually wind up in RAM.  I have two comments about his points from that article.

Human-Generated Data

I’m still not totally comfortable with Curt’s distinction between “human-generated” and “machine-generated” data.  Data from humans always goes through machines, so at some level all data is machine-generated. I think what you’re saying is that the number of humans is roughly constant (on the time scale you mean), and they only have so much time in a day to key in data, etc.  But what about trends that create more bits from any particular bit of human activity?
In the old days, records in databases were created when a person “keyed in” some fields.  Now, data is generated every time you click on something.  As data systems increase in capacity, won’t computers start gathering more and more data for each human interaction?  For example, every time I click, the system records what I clicked on, plus such context the entire contents of what was on my browser screen, how long it was since my last click, plus the times for each of the previous 1,000 clicks, everything it currently (this keeps changing) knows about my buying habits, etc.
That may be far-fetched, but I’m not so sure: betting on things staying the same size as they are has usually turned out to be less than prescient.  In any case, the underlying principle is analogous to the “Freeway Effect”: if there are higher data rates and databases, there will never be “enough”.

We’ll find more data to transmit and more to store, forever and ever.


In-RAM Database Systems

Having a database “in RAM” can mean more than one thing.

In traditional DBMS design, data “in RAM” is vulnerable to a very common failure mode, namely, the machine crashing.  So no database data is considered to be durable (the “D” in “ACID”) until it has been written to disk, which is less vulnerable, especially if you use RAID, etc.  So traditionally writes are sent to a log and forced to disk.  You can still keep the data itself in RAM, but recovery from the log will take longer and longer as the log grows in size, so you “checkpoint” the data by writing it out to disk.  That can be done in the background if everything is designed properly.  This is utterly standard.

It’s also traditional that there isn’t enough RAM to hold the whole database, so RAM is used as a cache.  This creates some issues when you have to write a modified page back to disk NOT as part of a checkpoint, and there are very standard ways to deal with that.

“In RAM” can mean (a) as above, but ususally/always the RAM cache is so big that you never overflow the cache; (b) the database system is designed so that data must fit in RAM, which can simplify buffer management and recovery algorithms; (c) you get around the machine-crash problem some way or other and really do keep everything only in RAM.

One way to do (c) is to keep all data in (at least) two copies, such that they’ll never both be down.  This requires that the machines (1) have very, very independent failures modes, which is not as easy to do as one might think, and (2) get fixed very quickly, since while one is down you have fewer copies.  Issue (2) is one reason to keep more than two copies; usually three copies are recommended, with one being at a “distant” data center.

This approach can be used for the log even if not for the whole DBMS.  HFS, the Hadoop File System, and VoltDB consider this the preferred/canonical way to go.  In both cases, some users still feel uncomfortable with approach (c), and so both have put in ways to commit the log to a conventional disk.  The hope is that as approach (c) proves itself in real production environments over the years, it will be more and more accepted.




Come to see SPACE OPERA!

April 4th, 2011
news and informationbusiness,health,entertainment,technology automotive,business,crime,health,life,politics,science,technology,travel

Please come to the North Cambridge Family Opera’s production of


by NCFO founder David Bass. Space Opera is a light-hearted galactic odyssey, based on a familiar science fiction tale of heroes and villains, robots and aliens, unlikely adventures, and supernatural nonsense.

Featuring entertaining lyrics set to singable music in a variety of popular and classical styles, Space Opera is presented by an inter-generational cast in English with side titles. This entirely sung 110-minute show (plus intermission) is full of dancing Stormtroopers, singing Jawas, droids of all shapes and sizes, and a cantina (with a live band!) that is indeed a wretched hive of scum and villainy. The sets and lighting add some stunning effects to the show. Here is a synopsis with samples of the music.

Space Opera will be performed eight times at the Peabody School, 70 Rindge Avenue, Cambridge MA, which is at North Cambridge, between Porter Square and Arlington. The first four shows have already happened; upcoming shows are:

  • Saturday, April 9 at 2:00pm
  • Saturday, April 9 at 7:00pm
  • Sunday, April 10 at 1:00pm
  • Sunday, April 10 at 5:30pm

The cast of more than 150 soloists and chorus members are drawn from many Greater Boston communities and range in age from 7 to 84. They are divided into two casts, and both are excellent, but I’m partial to the cast performing in shows 1, 3, 6 and 8 because I’ll be performing (as Owen Lars, Luke’s uncle). Information on which cast members perform in which shows can be found here.

Admission this year is free, with a suggested donation of $5 for children, $10 for adults. Snacks will be available for purchase at intermission, as well as pizza at the Sunday 5:30pm shows. T-shirts, CDs and DVDs will also be available for purchase at intermission.

Come early to make sure you get a seat! For more information about NCFO, visit , and please forward this information to anyone you think may be interested.

We’d love it if you RSVP on our Facebook page.

Feel free to invite your friends and help us spread the word.

See you at the show!