OpenFacts, the Open Source Knowledge Database
Wikimedia sites went down just after 12 noon US/Eastern (16:00 UTC) on Sunday June 6. It was available read-only from 15:42 to 18:15 on Sunday, and again starting at 15:55 Monday for a few hours. Was brought back up in read-write mode on Monday the 7th at 22:39. The causes were database and human factor related. The wiki engine itself played no role. At the time, the total size of the database was about 80 gigabytes, by far the largest public wiki database in the world.
- Times are US Eastern (UTC -4) except server times, which are UTC.
- Suda: current primary database server running MySQL 4.0.17.
- Ariel: new database server, still in test, has known MySQL client issues using the stock MySQL client. Had old backup and current article versions from this morning in MySQL on it for testing.
- Many things are being done at the same time - if you see that one task has started it doesn't mean that another has finished.
At some point prior to or during the following initial events the making of the weekly database copies for downloaders had been started and had proceeded as far as de, leaving some empty and some available (those from last week are also available).
On Sunday, June 6, at about 11:13 AM changes to Suda (database server) mysql memory to double tmp_table_size to 128MB and sort_buffer_size to preumably double its value were made.
Starting at 11:40 eastern mysqld stopped taking connections, including from the unix domain socket. At 12:10 Tim reported:
"I started a mysqld instance on a different TCP port to [the] live server, to test the configuration, then shut it down. Problem is, now I can't connect to the live server on the unix domain socket. 'mysqladmin -P 3306 shutdown' tries to connect to the unix port and fails, despite my port specification. Is there some way to restart the remaining daemon short of kill -9? Should I comment out the socket in my.cnf?" People on #mysql knew no answer to these questions.
This was followed by:
[06-12:16:38] <Xirzon> so why don't we kill -9 it?
[06-12:17:01] <Xirzon> if they trap all other signals that seems like the only way
[06-12:17:04] <TimStarling> can that cause data loss?
[06-12:17:13] <Xirzon> data loss data schmoss
[06-12:17:32] <TimStarling> ok
[06-12:18:xx] * TimStarling copies channel log for use in later trial
[06-12:18:xx] <Xirzon> *cough*
[06-12:18:xx] <Xirzon> it's not doing anything right now
[06-12:18:xx] <Xirzon> from when's the last backup?
mysqlsafe is presumed to have sent the signal and mysql is presumed to have exited after a few minutes. The site was working briefly after it had been killed. Lots of weird stuff in the log was reported. Memory changes to previous values didn't resolve them.
Server reported "040606 16:43:52 InnoDB: Assertion failure in thread 28680 in file fsp0fsp.c line 3034" an error related to database corruption on restart at 12:44. Error continued on following restart attempts.
myisamcheck was run to repair isam parts of DB. innodb_force_recovery was set to 4 and then 5 to allow data saving. Errors reported in en cur. Worked around mostly with different dumping methods at 13:48.
Ariel has hours old cur and 12 May history. Last full dump was a week ago.
Site was online in read only mode from 15:42 to to 18:15 or so.
Work to save latest cur from Ariel (new server) and dump all possible data from Suda is ongoing.
One or two records from en current versions (en cur) is known lost (Virginia Beach city and/or Surrey, Virginia perhaps) from Suda; may be available from Ariel, old versions probably fine. Those were probably due to updates to those articles being saved while the server was crashing.
At 20:00 Eastern the backup had completed de. Moving on to en an error in the table en old (history versions) was found. Backup of the rest continued with recovery work on en old postponed until the backup had completed all possible work and uncovered all possible problem areas.
Also at about 20:00 Eastern Ariel stopped acccepting remote connections. This was just before running the MySQL test suite to try to determine whether it would be safe to switch to Ariel for production.
At 22:40 Eastern it is noticed that NFS2 was mounted to hold the backups because it hadn't been forced to use 3 and 2 was the default for backwards compatibility. That caused the backup of en old to fail as it passed 2GB, which is the maximum size of a file with NFS2. Backup continued with other tables and any which are larger will be backed up later after remounting with NFS3.
On Monday, June 7, at 01:03 Eastern remounting with NFS 3 was done, with some difficullties along the way.
At 05:30 Eastern the backup of the database was still continuing.
Before 08:00 US Eastern the rebuild from the fresh backup had started.
Ariel was restarted successfully at 13:40 US Eastern and work to have it serve pages again is ongoing while the restore to Suda proceeds. Ariel is likely to be no more reliable than before but it will offer more current service than search engines and mirrors.
At 15:48 US Eastern the load of the old table (all old article versions) for English Wikipedia into Suda finished, some 4 million different article versions. This was the largest single table in the restore job. The reload is continuing.
At 15:55 US Eastern read only service from Ariel resumed, with current article versions from early Sunday and a gap in revision history between may and present. When Suda is in use all of the history will be present - this is just temporary.
At 18:00 Eastern reloading of the database on Suda is nearing completion, with zh (Chinese) the largest remaining loading task. At this time, the main status display changed from "estimated downtime" to "estimated time of resumption of normal service", about one hour in the future.
17:45:36 up to pl sstarting reload 17:48:09 sv and zh are biggish 17:48:35 qu 17:48:38 rm 17:48:38 rn 17:48:39 ro 17:49:56 ru 17:50:41 sk 17:50:42 sl 17:52:00 sq 17:52:03 sr 17:52:16 ss 17:52:20 st 17:52:24 su 17:52:26 sv 17:58:40 ta
At about 18:40 Ariel stopped serving pages and the site was switched to offline (cached pages only) mode.
21:23 Eastern, a message appears on wikipedia: As of 1:23 UTC on 8 June, we think we're pretty darn close to getting back online. Soon folks!
Normal service resumed at 22:40 US Eastern time on 7 June (02:40 8 June by this UTC display).
- These Rambot machine-generated pages were damaged:
- The list of pages using each image was erased and will be regenerated.
- Archive tables weren't included in the backup and may be retrievable from old database files if necessary. Archive tables are not the old versions of articles - those are fine.
- Edits saved in the few minutes before the event may be lost, or not, depending on the exact timing.
After the fact analysisThe probable cause was starting a second copy of the database with the same memory allocation as the primary database, over-allocating the RAM on the server. That or the following kill -9 probably caused the database to save a corrupt page, though it's possible that both databases attempting to update the same database files did so or that all of these caused damage. The second database was started to test a modified database configuration, with the intent that it shouldn't do any harm to the running database.
The database should have recovered automatically and lost no more than a few minutes of updates. It didn't. Recovery efforts which should have worked around a damaged database page also didn't work in this case.
The project should have had at least one replicating child database to switch to. It had one but there were problems keeping it in synch and developer time had been consumed by the pre-release of the new wiki software, causing the time necessary to resolve the problem and get a reliable child in place not to be spent. That left no up to the minute hot spare database available and left the choice of a week old backup or dumping and reloading the whole database.
With an 80GB database, just copying a backup from one computer to another over a 100 megabit local network takes close to three hours. That effectively requires a working replication child if an outage of at least that long is to be avoided. Consideration is being given to a gigabit ethernet LAN to reduce the time involved in simply copying files around.
As of 25 June the project has two replicating child database - Ariel and Will - plus the primary database server. Ariel is serving some queries. Will is for backup only. Another one or more seems likely. This isn't an experience anyone is interested in repeating.
The mediawiki software itself wasn't a factor.