Big data has been defined by the industry experts in a number of ways. Big data, as generally described, is an extremely large set of data that may be analyzed computationally to identify patterns, trends, associations etc. But exactly how much data can be called big data? For instance, to study a simple bone fracture, five to ten case sets would be enough but for image recognition the sample set would exceed even 2 million images. Hence, as Peter Nordig of Google puts it – big data is qualitatively different over quantitatively different i.e. how much data is enough to learn from rather than how much data you have.
Big data exists because of machine learning. What machines learn from big data is very much like what human beings learn from life experiences. A much better analogy of this type of learning would be wisdom. It’s similar to how human beings learn from life experiences and figure out the rules to go by but at the same time infer the situations where these rules don’t apply. The ability to intuitively learn such corner cases is what advanced machine learning would imply. Business intelligence going forward would be the ability to implement machine algorithms to big data. Not just to look at past questions but also future questions. The ability to predict the unknown from the knowns.
What’s changed to make big data possible today? The big data has always existed but we just didn’t collect it. The machine learning algorithms have always been around for a long time as well. It’s just that certain technologies got cheaper and more available. Like the advent of things like the Hadoop project and the launch of companies like Cloudera back in 2008. They have made it affordable for many more companies to begin acquiring and storing of data.
The most simplistic overall hierarchy from bottom-up of a big data stack can be described as: the big data or storage layer at the bottom, the computational or processing or big compute layer in the middle and the user experience or application layer at the top. The big data vendors use this basic model to implement and address the functionalities of their overall system. To explore this further, the first example we think of when we say big compute is MapReduce which is parallelized computing system that can take all this data, do some computing and put it back and maybe do some aggregation. For example, if we have data stored across thousands of rows, it would become increasingly complex and would require MapReduce to aggregate it. MapReduce originally wasn’t designed to handle queries.
MapReduce was designed to be slow. Map reduce as implemented at Google by Jeff Dean and Sanjay Ghemawat back in 2000 was intended to perform one job i.e. to crawl and index the web. Their approach was to parallelize it over thousands of machines. The probability of one these machines going down is almost one. There are two possibilities in this scenario – does it start all over again in which case it will never finish or does it start from where it left off by replacing the down machine with a new one to pick up from where it left off in which case it should be slow enough to do that.
The two popular frameworks for big data are: Hadoop and Apache Spark. Hadoop is primarily is a storage layer. HDFS – the hadoop distributed file system provides high parallel processing with storage replication for resilience. It is also capable of running on commodity hardware therefore, is very affordable. It uses MapReduce for distributed storage and distributed processing. Apache Spark takes a different approach. It is aimed at doing lot of these queries really fast. It is inspired by the basics of hardware about accessing RAM rather than storage for faster results. Hence, it uses memory due to faster accessibility and therefore, is faster than Hadoop.
When we look at the various layers of a big data stack from top-down view, it’s important to note that we want the whole system to be fast and cheap. There’s a human user at the top, operating all of that in this view. So there’s a layer missing for the user experience and application: the big apps layer, on top of the bottom two layers. And when all three of these layers are working effectively and harmoniously, then it can be called as a very good big data stack.
Machine learning will be a property of every application as opposed to it being a stand alone isolated function. This will enable the machine to access and learn from a much wider data set and expose it to a much larger range of possibilities to make more humanistic actions and gain intuition insights. Many of these technologies would have failed 50 years back due to costly and incompatible hardware. Their advancement in this day age has been facilitated largely by the revolution in hardware and storage capacities which were unavailable previously. It would surely beckon a future where machines and their learning would become more natural and anticipative as opposed to the mere instructional robot of today.
The above article is inspired by a16z Podcast: Making Sense of Big Data, Machine Learning, and Deep Learning with Christopher Nguyen.