Years ago, Facebook began playing around with the idea of implementing, The Open Source Software, Hadoop to handle their massive data consumption and aggregation. Their hesitant first steps of importing some interesting data sets into a relatively small Hadoop cluster were quickly rewarded as developers locked onto the map-reduce programming model and started processing data sets that were previously impossible due to their massive computational requirements. Some of these early projects have matured into publicly released features like the Facebook Lexicon, or are being used in the background to improve user experience on Facebook.
Facebook has multiple Hadoop clusters deployed now - with the biggest having about 2500 CPU cores and 1 PetaByte of disk space. Facebook is loading over 250 gigabytes of compressed data (over 2 terabytes uncompressed) into the Hadoop file system every day and have hundreds of jobs running daily against these data sets. The list of projects that are using this infrastructure has proliferated - from those generating mundane statistics about site usage, to others being used to fight spam and determine application quality. An amazingly large portion of their engineers has run Hadoop jobs at some point.
The rapid adoption of Hadoop at Facebook has been aided by a couple of key decisions. First, developers are free to write map-reduce programs in the language of their choice. Second, Facebook has embraced SQL as a familiar paradigm to address and operate on large datasets. Most data stored in Hadoop's file system is published as Tables. Developers can explore the schemas and data of these tables much like they would do with a good old database. When they want to operate on these data sets, they can use a small subset of SQL to specify the required dataset. Operations on datasets can be written as a map and reduce scripts or using standard query operators like joins and group-bys or as a mix of the two. Over time, Facebook has added classic data warehouse features like partitioning, sampling, and indexing to this environment. This in-house data warehousing layer over Hadoop is called Hive and they are looking forward to releasing an open source version of this project in the near future.
At Facebook, it is incredibly important that they use the information generated by and from their users to make decisions about improvements to the product. Hadoop has enabled them to make better use of the data at their disposal.