“We provide real-time analytics capability on top of Hadoop - a key requirement for online businesses and enterprises to extract business insights and take action quickly”
Online businesses and enterprises are experiencing an information explosion as their customer touch points and interaction channels continue to expand to include e-commerce, blogs, forums, social networks, mobile media, dynamic ad placements, and more. Meanwhile, IT departments are facing a formidable task of tackling the inbound burst of data coming from many new sources, live or otherwise, and preparing it for real-time decision-making.
It is now well known that traditional database technologies are simply inadequate and unable to cope with this new live data streams. Apart from the sheer amount of volume coming from all these sources, the pre-conditioning and formation of this input source data into a predefined schema renders any solution following traditional database technologies impossible to maintain.
However, there is highly competitive pressure to derive meaningful information from this massive influx of data. This is where business decisions are starting to be made in an accelerated manner.
To address the sheer volume of input data, some form of parallel computing needs to be utilized. While parallel computing had always been a theoretical topic for many decades, it only now materialized in the masses through the advent of the Hadoop framework and the associated ecosystem.
Hadoop is a software framework, written in Java™, for building and deploying Big Data storage and analysis systems using large distributed clusters running on commodity hardware. By design, Hadoop is ideal for processing large volumes of data, in batch, that can be easily decomposed using a computational paradigm called MapReduce where the query/application is divided into many small fragments of work, each of which may be executed (or re-executed) on any node in the cluster.
In effect, the Hadoop framework provides reliability and data distribution to applications transparently and provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster.
Hadoop is a key technology driver to deal with unstructured and semi-structured data (Big Data) which can include raw text, binary data, web pages, email, documents, sensor data, click streams, log files, interaction data, image data, video metadata, and more.
Hadoop relies on many key components to deliver its full capabilities as discussed below.
The Hadoop Distributed File System (HDFS) is based on the Google® File System (GFS) and provides redundant storage of massive datasets on low-cost commodity servers. By design, HDFS distributes the data across all nodes and this enables efficiency and parallel processing using MapReduce.
While HDFS automatically distributes data across the cluster, MapReduce distributes a task (query or application) across multiple nodes so the processing is done on data stored on each node locally. The Map and Reduce operations allows for automatic parallelism and distributed fault-tolerance thus providing a clean abstraction interface for programmers.
MapReduce is one of the forms to enable parallel computing on the Hadoop platform and it typically the only one assumed in the context of Hadoop.
Hadoop is considered to be a batch-oriented system without the ability to address fast decision making in a real-time or semi real-time manner. The database equivalent part of the Hadoop ecosystem following a new data organization paradigm without strict schemas (also referred to a “NoSQL”) is Apache HBase.
HBase is a column-oriented data store and is widely referred to as the Hadoop Database. It provides random, real-time read/write access to large amounts of data and allows one to manage tables consisting of billions of rows, with millions of columns. It is designed and optimized to run on Hadoop with data stored in HDFS. It allows the ability to process data in parallel across a Hadoop cluster with seamless scale-out characteristics as the input load increases.
Hadoop is the fastest growing Big Data technology with wide enterprise adoption for processing large volumes of data. However, with Hadoop being an open source technology, other challenges remain such as complexity, resource requirements, lack of integrated toolsets to manage the stack, and its ability to handle real-time data streams. In addition, providing real-time analytics capability on top of Hadoop, for exploration, modeling, discovery and predictions from large-scale datasets, is a key requirement for online businesses and enterprises to extract business insights and act on them quickly. This is where an important fusion of technologies that are very complementary in nature should be used.Real-time Analytics requires three important components, among others, to allow quick real-time data exploration while extracting actionable insights. They are:
1. In-Memory Processing
2. Search Indexing
3. Data Modeling via Statistical Analysis, Data Mining and Machine Learning
The most prevalent technologies widely used by data scientists for advanced analytics and decision-making is R and more recently Apache Mahout offering statistical analysis and scalable machine learning and data mining with implementations of a wide range of machine learning and data mining algorithms encompassing clustering, classification, collaborative filtering and frequent pattern mining. However, custom algorithms are also very important in the real-time context since they often need to be tailored to strict time budgets and streams processing.
Cetas provides Big Data Analytics capabilities for businesses.
Be the first to know more when we are ready to announce our offering!
We value your privacy and will not disclose your email address to 3rd parties nor will we spam you.
Copyright © 2013 Cetas by VMWare, Inc. All rights reserved.