Short Description:

This article provides a step by step overview of how to setup cross data center data flow using Apache Nifi.


Traditionally enterprises have been dealing with data flows or data movement within their data centers. But as the world has become more flattened and global presence of companies has become a norm, enterprises are faced with the challenge of collecting and connecting data from their global footprint. This problem was daunting NSA a decade ago and they came up with a solution for this using a product which was later named as Apache Nifi.

Apache nifi is a easy to use, powerful, and reliable system to process and distribute data. Within Nifi, as you will see, I will be able to build a global data flow with minimal to no Coding. You can learn the details about Nifi from Apache Nifi website. This is one of most well documented Apache projects.

The focus of this article to just look at one specific feature within Nifi that I believe no other software product does it as well as Nifi. And this feature is “site to site” protocol data transfer.

Business use case

One of the classic business problem is to push data from a location that has a small IT footprint, to the main data center, where all the data is collected and connected. This small IT footprint could be a oil rig at the middle of the ocean, a small bank location at a remote mountain in a town, a sensor on a vehicle so on and so forth. So, your business wants a mechanism to push the data generated at various location to say Headquarters in a reliable fashion, with all the bells and whistles of an enterprise data flow which means maintain lineage, secure, provenance, audit, ease of operations etc.

The data that’s generated at my sources are of various formats such as txt, csv, json, xml, audio, image etc.. and they could of various size ranges from few MBs to GBs. I wanted to break these files into smaller chunks as I have a low bandwidth at my source data centers and want to stich them together at the destination and load that into my centralized Hadoop data lake.

Solution Architecture

Apache Nifi (aka Hortonworks Data Flow) is a perfect tool to solve this problem. The overall architecture looks something like Fig 1.

We have a Australian & Russian data center from where we want to move the data to US Headquarters. We will have what we call edge instance of nifi that will be sitting in Australian & Russian data center, that will act as a data acquisition points. We will then have a Nifi processing cluster in US where we will receive and process all these data coming from global location. We will build this end to end flow without any coding but rather by just a drag and drop GUI interface.

Build the data flow

Here are the high level steps to build the overall data flow.

Step1) Setup a Nifi instance at Australian data center that will act as data acquisition instance. I will create a local instance of Nifi that will act as my Australian data center.

Step2) Setup Nifi instance on a CentOS based virtual machine that will act as my Nifi data processing instance. This could be cluster of Nifi as well but, in my case it will be just a single instance.

Step3) Build Nifi data flow for the processing instance. This will have an input port that will indicate that this instance can accept data from other Nifi instances.

Step4) Build Nifi data for the data acquisition instance. This will have a “remote process group” that will talk to the Nifi data processing instance via site-to-site protocol.

Step5) Test out the overall flow.

Attached is the document that provides detailed step by step instruction on how to set this up.


Reference :

The post Apache Nifi (aka HDF) data flow across data center appeared first on The Big Data Blog.

Source: Apache Nifi (aka HDF) data flow across data center

Leave a Reply

Your email address will not be published. Required fields are marked *


1 2 3
February 17th, 2016

Kaggle Competition Past Winner Solutions

We learn more from code, and from great code. Not necessarily always the 1st ranking solution, because we also learn […]

February 7th, 2016

Installing Kafka on Mac OSX

Apache Kafka is a highly-scalable publish-subscribe messaging system that can serve as the data backbone in distributed applications. With Kafka’s […]

February 5th, 2016

Lucene In-Memory Search Example and Sample Code

More sample code:  Sample code import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.queryParser.ParseException; import org.apache.lucene.queryParser.QueryParser; import*; […]

February 5th, 2016


I pro­vide a basic index­ing and retrieval code using the PyLucene 3.0 API.Lucene In Action (2nd Ed) cov­ers Lucene 3.0, but […]

January 29th, 2016

NiFi: Thinking Differently About DataFlow

Recently a question was posed to the Apache NiFi (Incubating) Developer Mailing List about how best to use Apache NiFi […]

January 29th, 2016

Apache Nifi (aka HDF) data flow across data center

Short Description: This article provides a step by step overview of how to setup cross data center data flow using […]

January 24th, 2016

Accurately Measuring Model Prediction Error

When assessing the quality of a model, being able to accurately measure its prediction error is of key importance. Often, […]

January 9th, 2016


A time series is a sequence of data points, typically consisting of successive measurements made over a time interval. Forecasting […]

January 7th, 2016

Getting Started with Markov Chains

There are number of R packages devoted to sophisticated applications of Markov chains. These include msm and SemiMarkov for fitting […]

December 26th, 2015

Hadoop filesystem at Twitter

Twitter runs multiple large Hadoop clusters that are among the biggest in the world. Hadoop is at the core of […]