Map Reduce for the People

This is a guest post by Russell Jurney, a technologist and serial entrepreneur. His new startup, Cloud Stenography, will launch later this year. The article is an extension of a simple question on Twitter asking the importance of Map Reduce. Some subjects take much more than 140 characters.

The Technical Situation in Brief

The advent of the personal computer and the Visicalc spreadsheet were the foundation for a revolution in computing, business and life whereby normal people could carry out sophisticated
accounting, analysis and forecasting to inform their decisions to arrive at more positive outcomes. As Moore’s law has progressed and processors have become faster, and computers inter-networked, large volumes of highly granular data have been collected. Analysis of terabyte datasets on the same level as a spreadsheet has been limited by the disparity of acceleration between processor speed and computer I/O (input/output) operations. Intel has produced ever faster processor clock speeds without accompanying disk, RAM or bus speeds. Put simply: We have cheap and numerous computing resources and abundant data, but bringing those resources to bear on that data to generate real value from it has proven exceedingly difficult.

The widespread use of relational databases to access data in pre-defined static relationships has also limited our ability to discover and infer new and unique relationships among data. Dynamic analysis of large volumes of data in relational databases requires exhaustive pre-calculation of indexes and summaries of data for each relationship, and scaling relational databases to handle large datasets is a complex, painful and expensive process. As a result business intelligence systems relying on relational databases are prohibitively complex and expensive. Other methods of raw parallel computation, such as MPI, were exceedingly difficult. Such ‘smart kid only’ technologies have significant barriers of entry for mere mortals. In fact, multi-threaded, shared-memory computation in languages like C++ are considered some of the most difficult, arcane areas of computer science, leading to entire languages aimed at making concurrency easier.

MapReduce As the Way Forward

In order to extract value from large piles of data, we must escape the bounds of IO by going parallel and having many processors work on the data at once, without grinding our development to a halt dealing with complex algorithms and frameworks. MapReduce and platforms that implement it satisfy this requirement for a surprisingly broad set of problems. MapReduce is a simple way to process data in parallel among many commodity machines. You are already familiar with the power of MapReduce in your daily use of it – it is the pattern pioneered by Google to bring you the effective search on which we now all depend.

MapReduce is the design pattern that in combination with recent developments in cloud computing and cheap, plentiful broadband will bring us spreadsheet-style analysis of vast amounts of data ill suited to traditional database management systems in both scale and structure. MapReduce offers a cost-effective way for any business to harness massive amounts of computational power in the cloud for short periods of time to perform complex computations on large volumes of data that would be prohibitively expensive and time consuming on an individual machine, or that would require the construction of a data center to handle.

The Business Impact

What does this mean for your business? Knowledge of MapReduce has spread beyond Google, and it is now used by an increasing number of companies to extract value from web-scale data. Facebook, Yahoo, Cloudera and many others have embraced MapReduce in the form of Apache Hadoop, the platform around which most open discussion of MapReduce has occurred. As a result, a new generation of startups is rising that will take advantage of MapReduce to bring the same power that Google pioneered on search to bear on a variety of datasets. New opportunities exist by ‘thinking big’ and extracting value from ever-increasing streams and volumes of data.

Example 1: Proving Global Warming

What does this really mean? It means that developers will have a clear way to reduce vast datasets to scales they can work with to extract information to inform your decisions. In this example from Cloudera, Hadoop and Pig are used to query a 138GB log of weather history for the last 100 years from the National Climatic Data Center to reduce that vast data to a scale the developer is comfortable working with.

As a pile of data, the NCDC log informs nothing. When queried via map/reduce using Hadoop and Pig, we arrive at an informative chart that shows us an important trend. Would that chart inform a discussion about global warming? If you could get such clear visualizations about every minutiae of your business critical to your success, would it inform your decisions? Can you log and mine more data to streamline your operations?

Example 2: A Supercomputer for Every Biologist

When Amazon S3, EC2 and MapReduce via Hadoop are applied to the RMAP algorithm of genetic analysis, thanks to the work of one grad student, the result is a point-click supercomputer for every biologist that wants one in the form of Cloudburst for Amazon Elastic Map Reduce. Now any biologist that wants a supercomputer for this kind of genetic analysis can have one by the hour, and its as easy as point-click. More map/reduce genetic analysis algorithms are sure to follow. That’s revolutionary.

Conclusion

We are constrained in our strategies by what we imagine possible. MapReduce and cloud computing open broad possibilities and business opportunities by placing a usable supercomputer by the hour in the hands of every startup that wants one. There is no problem which you lack the processing power to solve, its just a question of whether the hourly cost is profitable. That’s a profound change from being bound to one machine. As a result of this shift, smaller companies can attack ‘bigger’ problems without a large up-front investment in hardware or software infrastructure.

A new renaissance in computing is coming that will be comparable to the business adoption of the personal computer and VisiCalc, and MapReduce will drive it.