Introduction
The term ‘Big Data’ evokes images of large datasets – both structured and unstructured, having varied formats and sourced from various data sources. The volume of data alone does not define Big Data. There are 3V’s that are vital for classifying data as Big Data. These include Volume, Velocity and Veracity. When we speak of data volumes it is in terms of terabytes, petabytes and so on. Velocity is to do with the high speed of data movement like real-time data streaming at a rapid rate in microseconds. Veracity involves the handling approach for both structured and unstructured data.
Implementation of Big Data infrastructure and technology can be seen in various industries like banking, retail, insurance, healthcare, media, etc. Big Data management functions like storage, sorting, processing and analysis for such colossal volumes cannot be handled by the existing database systems or technologies. Frameworks come into picture in such scenarios. Frameworks are nothing but toolsets that offer innovative, cost-effective solutions to the problems posed by Big Data processing and helps in providing insights, incorporating metadata and aids decision making aligned to the business needs.
There are many frameworks presently existing in this space. Some of the popular ones are Spark, Hadoop, Hive and Storm. Some score high on utility index like Presto while frameworks like Flink have great potential. There are still others which need some mention like the Samza, Impala, Apache Pig, etc. Some of these frameworks have been briefly discussed below.
Apache Hadoop
Hadoop is a Java-based platform founded by Mike Cafarella and Doug Cutting. This open-source framework provides batch data processing as well as data storage services across a group of hardware machines arranged in clusters. Hadoop consists of multiple layers like HDFS and YARN that work together to carry out data processing.
- HDFS (Hadoop Distributed File System) is the hardware layer that ensures coordination of data replication and storage activities across various data clusters. In the event of a cluster node failure, real-time can still be made available for processing.
- YARN (Yet Another Resource Negotiator) is the layer responsible for resource management and job scheduling.
- MapReduce is the software layer that functions as the batch processing engine.
Pros include cost-effective solution, high throughput, multi-language support, compatibility with most emerging technologies in Big Data services, high scalability, fault tolerance, better suited for R&D, high availability through excellent failure handling mechanism.
Cons include vulnerability to security breaches, does not perform in-memory computation hence suffers processing overheads, not suited for stream processing and real-time processing, issues in processing small files in large numbers.
Organisations powered by Hadoop include Amazon, Adobe, AOL, Alibaba, EBay, Facebook, etc.
Apache Spark
The Spark framework was formed at the University of California, Berkeley. It is a batch processing framework with enhanced data streaming processing. With full in-memory computation and processing optimisation, it promises a lightning fast cluster computing system.
Spark framework is composed of five layers.
- HDFS and HBASE: They form the first layer of data storage systems.
- YARN and Mesos: They form the resource management layer.
- Core engine: This forms the third layer.
- Library: This forms the fourth layer containing Spark SQL for SQL queries while stream processing, GraphX and Spark R utilities for processing graph data and MLlib for machine learning algorithms.
- The fifth layer contains an application program interface such as Java or Scala.
Spark can function as a standalone cluster along with a capable storage layer or it can provide seamless integration with Hadoop.
It supports some of the popular languages like Python, R, Java, and Scala.
Pros include scalability, lightning processing speeds through reduced number of I/O operations to disk, fault tolerance, supports advanced analytics applications with superior AI implementation and seamless integration with Hadoop
Cons include complexity of setup and implementation, language support limitation, not a genuine streaming engine.
Organisations powered by Spark include Alibaba TaoBao, Amazon, Autodesk, Baidu, Hitachi Solutions, NASA JPL – Deep Space Network, Nokia Solutions and Networks, etc.
Storm
Storm, an open source framework, was developed in Clojure language specifically for near real-time data streaming. It is an application development platform-independent, can be used with any programming language and guarantees delivery of data with the least latency.
In Storm architecture, there are 2 nodes – the Master Node and Worker/ Supervisor Node. The master node monitors the failures of machines and is responsible for task allocation. In case of a cluster failure, the task is reassigned to another one.
Pros include ease in setup and operation, high scalability, good speed, fault tolerance, support for a wide range of languages
Cons include complex implementation, debugging issues and not very learner-friendly
Organisations powered by Storm include Twitter, Yahoo, Verisign, Baidu, Alibaba, etc.
Apache Flink
Apache Flink, an open-source framework is equally good for both batch as well as stream data processing. It is suited for cluster environments. It is based on transformations – streams concept. It is also the 4G of Big Data. It is the100 times faster than Hadoop -Map Reduce.
Flink system contains multiple layers
- Deploy Layer
- Runtime Layer
- Library Layer
Pros include low latency, high throughput, fault tolerance, entry by entry processing, ease of batch and stream data processing, compatibility with Hadoop.
Cons include few scalability issues.
Organisations powered by Flink include AWS, Uber, CapitalOne, Alibaba.com, etc.
Hive
Apache Hive, designed by Facebook, is an ETL (Extract / Transform/ Load) and data warehousing system. It is built on top of the Hadoop –HDFS platform.
The key components of the Hive Architecture include
- Hive Clients
- Hive Services
- Hive Storage and Computing
The Hive engine converts SQL- queries or requests to MapReduce task chains. The engine comprises of
Parser: It goes through the incoming SQL-requests and sorts them
Optimizer: It goes through the sorted requests and optimises them
Executor: It sends tasks to the Map Reduce framework
Pros include own query language HiveQL similar to SQL, suited for data-intensive jobs, support for a wide range of storages, shorter learning curve
Cons include not suited for online transaction processing.
Organisations powered by Hive include PayPal, Johnson & Johnson, Accenture PLC, Facebook Inc., J. P. Morgan, HortonWorks Inc, Qubole, etc.
Presto
Presto is the open-source distributed SQL tool most suited for smaller datasets up to 3Tb.
Presto engine includes a coordinator and multiple workers. When client submits queries, these are parsed, analysed, their execution planned and distributed for processing among the workers by the coordinator.
Pros include least query degradation even in the event of increased concurrent query workload. It has a query execution rate that is three times faster than Hive. Ease in adding images and embedding links. Highly user-friendly.
Cons include reliability issues
Organisations powered by Presto include AirBnb, Facebook, NetFlix, DropBox, NasDAQ, Uber, etc.
Impala
Impala is an open-source MPP (Massive Parallel Processing) query engine that runs on multiple systems under a Hadoop cluster. It has been written in C++ and Java.
It is not coupled with its storage engine. It includes 3 main components
- Impala Daemon (Impalad): It is executed on every node where Impala is installed.
- Impala StateStore
- Impala MetaStore
Impala has its query language like SQL.
Pros include supports in-memory computation hence accesses data without movement directly from Hadoop nodes, smooth integration with BI tools like Tableau, ZoomData, etc., supports a wide range of file formats.
Cons include no support for serialisation and deserialization of data, inability to read custom binary files, table refresh needed for every record addition.
Organisations powered by Impala include Bank of America, J. P. Morgan, Apple, MetLife, etc.
Samza
Samza is an open-source tool for streaming data processing that was designed at LinkedIn.
It has 3 layers
- Streaming
- Execution
- Processing
Pros include operational ease, high performance, horizontal scalability, ability to execute same code for batch processing as well as streaming data and pluggable architecture
Cons include unsuitable for extremely low latency processing.
Organisations powered by Samza include Optimizely, Expedia, VMWare, ADP, etc.
Conclusion
There is no dearth for frameworks in the market currently for Big Data processing. There is no single framework that is best fit for all business needs. But to highlight a few frameworks, Storm seems best suited for streaming while Spark is the winner for batch processing. For every organisation or business, one’s own data is most valuable. Investing in Big Data frameworks involves spending. Many frameworks are freely available while some come with a price. Depending on the project needs, avail of trial versions offered.
Understand the goals of the business. If possible, experiment with the framework on a smaller scale project to understand its functioning better. Scalability is an aspect which should be borne in mind for future implementations. At times, the solution may lie in using multiple frameworks depending on its feasibility for the various process components involved. Investing in the right framework can pave the way for success in business.