Force of Good

Map Reduce for the People

Apr 09, 09 in Computing, Internet   12 Comments

This is a guest post by Russell Jurney, a technologist and serial entrepreneur.  His new startup, Cloud Stenography, will launch later this year.  The article is an extension of a simple question on Twitter asking the importance of Map Reduce.  Some subjects take much more than 140 characters.

The Technical Situation in Brief

The advent of the personal computer and the Visicalc spreadsheet were the foundation for a revolution in computing, business and life whereby normal people could carry out sophisticated accounting, analysis and forecasting to inform their decisions to arrive at more positive outcomes. As Moore’s law has progressed and processors have become faster, and computers inter-networked, large volumes of highly granular data have been collected. Analysis of terabyte datasets on the same level as a spreadsheet has been limited by the disparity of acceleration between processor speed and computer I/O (input/output) operations. Intel has produced ever faster processor clock speeds without accompanying disk, RAM or bus speeds. Put simply: We have cheap and numerous computing resources and abundant data, but bringing those resources to bear on that data to generate real value from it has proven exceedingly difficult.

Visicalc

The widespread use of relational databases to access data in pre-defined static relationships has also limited our ability to discover and infer new and unique relationships among data. Dynamic analysis of large volumes of data in relational databases requires exhaustive pre-calculation of indexes and summaries of data for each relationship, and scaling relational databases to handle large datasets is a complex, painful and expensive process. As a result business intelligence systems relying on relational databases are prohibitively complex and expensive. Other methods of raw parallel computation, such as MPI, were exceedingly difficult. Such ‘smart kid only’ technologies have significant barriers of entry for mere mortals. In fact, multi-threaded, shared-memory computation in languages like C++ are considered some of the most difficult, arcane areas of computer science, leading to entire languages aimed at making concurrency easier.

MapReduce As the Way Forward

In order to extract value from large piles of data, we must escape the bounds of IO by going parallel and having many processors work on the data at once, without grinding our development to a halt dealing with complex algorithms and frameworks. MapReduce and platforms that implement it satisfy this requirement for a surprisingly broad set of problems. MapReduce is a simple way to process data in parallel among many commodity machines. You are already familiar with the power of MapReduce in your daily use of it - it is the pattern pioneered by Google to bring you the effective search on which we now all depend.

MapReduce is the design pattern that in combination with recent developments in cloud computing and cheap, plentiful broadband will bring us spreadsheet-style analysis of vast amounts of data ill suited to traditional database management systems in both scale and structure. MapReduce offers a cost-effective way for any business to harness massive amounts of computational power in the cloud for short periods of time to perform complex computations on large volumes of data that would be prohibitively expensive and time consuming on an individual machine, or that would require the construction of a data center to handle.

The Business Impact

What does this mean for your business? Knowledge of MapReduce has spread beyond Google, and it is now used by an increasing number of companies to extract value from web-scale data. Facebook, Yahoo, Cloudera and many others have embraced MapReduce in the form of Apache Hadoop, the platform around which most open discussion of MapReduce has occurred. As a result, a new generation of startups is rising that will take advantage of MapReduce to bring the same power that google pioneered on search to bear on a variety of datasets. New opportunities exist by ‘thinking big’ and extracting value from ever-increasing streams and volumes of data.

Example 1: Proving Global Warming

What does this really mean? It means that developers will have a clear way to reduce vast datasets to scales they can work with to extract information to inform your decisions. In this example from Cloudera, Hadoop and Pig are used to query a 138GB log of weather history for the last 100 years from the National Climatic Data Center to reduce that vast data to a scale the developer is comfortable working with. The result is this chart:

As a pile of data, the NCDC log informs nothing. When queried via map/reduce using Hadoop and Pig, we arrive at an informative chart that shows us an important trend. Would that chart inform a discussion about global warming? If you could get such clear visualizations about every minutiae of your business critical to your success, would it inform your decisions? Can you log and mine more data to streamline your operations?

Example 2: A Supercomputer for Every Biologist

When Amazon S3, EC2 and MapReduce via Hadoop are applied to the RMAP algorithm of genetic analysis, thanks to the work of one grad student, the result is a point-click supercomputer for every biologist that wants one in the form of Cloudburst for Amazon Elastic Map Reduce. Now any biologist that wants a supercomputer for this kind of genetic analysis can have one by the hour, and its as easy as point-click. More map/reduce genetic analysis algorithms are sure to follow. That's revolutionary.

Conclusion

We are constrained in our strategies by what we imagine possible. MapReduce and cloud computing open broad possibilities and business opportunities by placing a usable supercomputer by the hour in the hands of every startup that wants one. There is no problem which you lack the processing power to solve, its just a question of whether the hourly cost is profitable. That's a profound change from being bound to one machine. As a result of this shift, smaller companies can attack 'bigger' problems without a large up-front investment in hardware or software infrastructure.

A new renaissance in computing is coming that will be comparable to the business adoption of the personal computer and VisiCalc, and MapReduce will drive it.

Comments

Wow. That global warming chart really proves something, but not the trend line that is drawn on it.

Look at the trend of 1900-1950 vs 1950-2000.

Warming? or "did warm"?

Warming?  |  Apr 09, 09 at 12:02 PM

You're right that 'proving global warming' is sensationalist. But if you have access to that dataset via Pig, you can easily reply to a such a chart with another chart, and then you're having an informed conversation with clear visualizations. Which is nice, no?

Russell Jurney  |  Apr 09, 09 at 12:24 PM

The article isn't about global warming - it's about MapReduce being used in science. The fact that you can even _try_ to make a salient point about the chart is because of map reduce converting _real_ data into useful information.

Your commment, however, is not useful information.

ANONOMAN  |  Apr 09, 09 at 12:30 PM

Yes, the Temp chart is probably not the best for illustrating the benefits of mapreduce. There is too much political freight associated with the interpretation of the data and relationship to the linear (why linear? why not least squares?) trend line, and this can overshadow mapreduce's contribution.

Fred  |  Apr 09, 09 at 12:30 PM

Fred, I must disagree with you. I think that the ability to access data in a way that informs political discourse is a great example of the power of mapreduce.

Russell Jurney  |  Apr 09, 09 at 01:36 PM

Fred, what do you mean about linear trend line vs. least squares? Isn't linear a model and least squares a loss function?

Eric  |  Apr 09, 09 at 02:35 PM

Generally speaking, Least Squares is a curve fitting technique. You can fit a linear, quadratic, cubic (e.g. polynomials), exponentials or whatever curve combination you guess to fit your dataset.
A quadratic curve may express acceleration or deceleration trends, instead of pure linear, that express growth or otherwise.

Funada  |  Apr 09, 09 at 04:06 PM

Eric, yes, I was imprecise. You are right that least squares is a method of fitting a function to data. One can fit a line or some higher order polynomial curve to the data using least squares. Eyeballing the data, it appears in this case information is lost in a linear fit, and that was what I was trying to say.

Russell, I agree that visualizing data is a great way to inform political discourse. I'd also claim that this graph is highly misleading because it is highly simplistic in the way it treats the temp data. My point is that the misleading nature of the graph detracts for the lesson about mapreduce. Maybe the lesson ought to be that mapreduce can help you massage data, but gathering useful information from massaged data is a Really Hard Problem.

For example, AGWers usually argue CO2 is a, if not the, primary temp forcing agent. Yet as the first commenter observed, the warming is greatest when the CO2 concentrations are lowest. We don't know if the start point of the data was cooler than a longer-term average, which would tend to give you the positive slope on a linear fit. More importantly, we have no idea what statistical confidence we can assign to the correlation between the line and the data points (what is the confidence level of the null hypothesis).

Fred  |  Apr 09, 09 at 06:25 PM

Fred, I guess I just think that throwing charts back and forth that actually have directly to do with the situation is about as productive discourse on a matter as one can have = but maybe thats just the chart geek in me :)

Russell Jurney  |  Apr 10, 09 at 01:13 AM

I think that charts are helpful, but I've read Tufte, and he makes me think whenever I see visuals. I believe we are in violent agreement around the folk wisdom that 1 chart == 1 kiloword. So, chart throwing is IMHO not necessarily the most productive form of discourse, but it certainly helps in getting to the heart of the matter.

Fred  |  Apr 11, 09 at 10:31 AM

What's amazing to me is how one comment from a wingnut can so completely derail the primary point of a technical article.

Pragmatic Realist  |  Apr 12, 09 at 04:03 PM

Hi,
wow great article, definitely a lot of great content in here to read.

ddr  |  Sep 26, 09 at 02:45 AM

Post a comment

Name:

Email Address:

URL:

  Remember me?
Comments: