Data Science

Big Data Infrastructure: tools and applications

I felt the need to put some clarity in my mind on growing Big Data applications cosmology. I will update this post as my own framework of best practices evolves, and I’ll probably post new articles on single applications as they come along under the tag Frameworks and Big Data.

Disclaimer: what follows is an attempt to schematize the complexity of the Big Data frameworks world, for my usage, I do not intend to create original software reviews. It may contain copy/paste of descriptions found on the internet.


Apache Hadoop and HDFS

“The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

The project includes these modules:

  • Hadoop Common: The common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.


It is the basis of most modern distributed infrastructures, most of the Big Data applications run on top of Hadoop, enhancing its features, or making working with Hadoop easier. In very simple words (apologies to the purists): you make use of the Hadoop file system, HDFS, to store files (in the data science field, they are commonly CSV files) into a number of computers, and use MapReduce operations (commonly programs written in Java) to perform operations across the computers in the cluster. This is needed when performing operations on a single machine is not feasible because of the amount of data.


Hbase: A NoSQL database on top of Hadoop

Hbase is a column-oriented NoSQL (Not Only SQL) database that sits on top Hadoop and uses HDFS to store information. It is defined “the Hadoop database” and comes after the Google’s BigTable project.

“Use Apache HBase™ when you need random, realtime read/write access to your Big Data. This project’s goal is the hosting of very large tables — billions of rows X millions of columns — atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google’sBigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.”

Hive: a data warehouse infrastructure for distributed data

Hive is a software that run on top of Hadoop and facilitate querying and structure design to large distributed datasets. Hive uses a SQL-like language called HiveQL.


Parquet: a columnar storage format for Hadoop

“Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.” It is framework independent, meaning that you can use and integrate Parquet with pretty much any framework you use to deal with data stored in Hadoop. It also provide a great level of data compression. It works greatly in combination with Impala, Hive and Spark.


Impala: an analytic database for Hadoop

“Impala is  integrated, state-of-the-art analytic database architected specifically to leverage the flexibility and scalability strengths of Hadoop – combining the familiar SQL support and multi-user performance of a traditional analytic database with the rock-solid foundation of open source Apache Hadoop.” It is the only tool developed by a vendor (Cloudera) and not part of the Apache project of this list.


Spark: a general purpose data processing engine

Spark is a fast and general-purpose engine for large-scale data processing. It enables in-memory, parallel, distributed data processing. Its goal has been to generalize MapReduce to support new apps within the same engine, making the process more efficient for the engine and much simpler for the end user. Two features made it the hottest Big Data software (at the time of this article): 1. data are processed, cached and stored in-memory, reducing dramatically the computation time as opposed to Hadoop MapReduce jobs; 2. it has logically simple API that works well on the most popular programming languages.

Some key points:

  • Spark is becoming the “lingua franca” of data analytics!
  • Handles batch, interactive and real-time within a single framework
  • Native integration with Java, Python and Scala
  • Programming at higher level of abstraction
  • More general: MapReduce is just one set of supported constructs

It support in-memory computing, and its infrastructure is based on Resilient Distributed Dataset (RDD). It supports Java, Scala, Python and R, and has over 100 operators for transforming data and familiar data frame APIs for manipulating semi-structured data. On top of the core engine, it comes with high level tools including Spark SQL, MLlib (for machine learning), GraphX (for graph analysis), Spark Streaming (for data streaming processing) and SparkR. Finally, it can run on HDFS, Cassandra, HBase, and S3 as well as standalone cluster mode.




Spark has the following components:

Spark Core: “Spark Core is the underlying general execution engine for the Spark platform that all other functionality is built on top of. It provides in-memory computing capabilities to deliver speed, a generalized execution model to support a wide variety of applications, and Java, Scala, and Python APIs for ease of development.”

Spark SQL: Enable the power of SQL queries for the exploration of structured data. It easily integrates with other modules, like MLlib for machine learning.

Spark Streaming: brings the ability to process stream of data in real-time. It integrates real-time analytics to historical data. It works with a variety of sources like HDFS, Flume, Kafka and Twitter.

MLlib: A library for machine learning algorithms that can be included into Spark application usable in Java, Scala and Python.

GraphX: library to work with graph structured data (i.e. connection of Facebook friends) at scale.

SparkR: it is an R package that brings Spark into R. It provides a distributed data frame implementation supporting operations on large datasets.


Other very useful software

  • Flume: ingest real-time streaming into Hadoop
  • Sqoop: bidirectional data transfer between Hadoop and relational databases
  • Cassandra: very popular column-oriented data store
  • MongoDB: not really part of the Hadoop Ecosystem, it is a NoSQL document-oriented database. Very popular.
  • Drill: query engine for Hadoop, NoSQL and cloud storage. It makes joins possible among different data stores.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s