Dropbox went down Friday night after some normally routine upgrades went awry. While the company restored most functionality within three hours, problems for some users persisted until Sunday.
The outage was followed by spurious claims from hacking groups that they had successfully infiltrated Dropbox. There was no evidence to support such claims, and Dropbox quickly explained on Friday that the outage was due to an internal problem. Dropbox head of infrastructure Akhil Gupta then followed up last night with more details on what caused the downtime:
We use thousands of databases to run Dropbox. Each database has one master and two slave machines for redundancy. In addition, we perform full and incremental data backups and store them in a separate environment.
On Friday at 5:30 p.m. PT, we had a planned maintenance scheduled to upgrade the OS on some of our machines. During this process, the upgrade script checks to make sure there is no active data on the machine before installing the new OS.
A subtle bug in the script caused the command to reinstall a small number of active machines. Unfortunately, some master-slave pairs were impacted which resulted in the site going down.
User files were never at risk during the outage, the company said. The databases in question are used to provide services like photo album sharing, camera uploads, and API features.
via Ars Technica http://feeds.arstechnica.com/~r/arstechnica/index/~3/ejCaBXol9To/