List

Recently a question was posed to the Apache NiFi (Incubating) Developer Mailing List about how best to use Apache NiFi to perform Extract, Transform, Load (ETL) types of tasks. The question was “Is it possible to have NiFi service setup and running and allow for multiple dataflows to be designed and deployed (running) at the same time?”

The idea here was to create several disparate dataflows that run alongside one another in parallel. Data comes from Source X and it’s processed this way. That’s one dataflow. Other data comes from Source Y and it’s processed this way. That’s a second dataflow entirely. Typically, this is how we think about dataflow when we design it with an ETL tool. And this is a pretty common question for new NiFi users. With NiFi, though, we tend to think about designing dataflows a little bit differently. Rather than having several disparate, “stovepiped” flows, the preferred approach with NiFi is to have several inputs feed into the same dataflow. Data can then be easily routed (via RouteOnAttribute, for example) to “one-off subflows” if need be.

One of the benefits to having several disparate dataflows, though, is that it makes it much easier to answer when someone comes to you and says “I sent you a file last week. What did you do with it?” or “How do you process data that comes from this source?” You may not know exactly what happened to a specific file that they sent you, step-by-step, because of the different decision points in the flow, but at least you have a good idea by looking at the layout of the dataflow.

So we can avoid merging the data if we would like. For the sake of an example, let’s assume that we have 4 different data sources. For each of them, they are going to send us some text data that needs to be pushed into HDFS. Maybe it’s compressed, maybe it’s not. So your flow will look like this:

Disparate Flows

Now, let’s say that that you’ve got a new requirement. When you’re sending text data to HDFS, each file that is pushed to HDFS needs to have 1,000 lines of text or less (yes, that’s a contrived example and that’s probably never a good idea, but the point is valid.) Now, consider how much work it is to make all of those modifications. And let’s hope that you don’t miss any!

If we continue down this path, this can get hairy quickly, as we have several different dataflows side-by-side on the same graph. In order to aid in the visual representation, we can use Process Groups to provide a nice logical separation. If we do that for each of those, we end up with something like:

Grouped Disparate Flows

We can then double-click each of those Process Groups and see/edit what’s inside. But we still have the issue of having to change 4 different flows to make the change mentioned.

So let us consider the alternate approach of merging it all into a single dataflow, and we end up with a flow like this:

Merged Flow

Now, we have all of the data going to a single stream. If we want to update it, we just insert one new Processor:

Updated Merged Flow

And we’re done. We don’t have to make this change to insert a SplitText processor 3 more times.

The concern that we have here, though, as mentioned earlier, is that if all of the data is mixed together, as the dataflow grows larger, how do we know what happens to data that came from Source X, for example?

This is where the Data Provenance feature comes in. In the top right-hand corner there’s a toolbar with 8 icons. The 4th one is the Provenance icon (Provenance Icon). If we click on that, we can then search for data that has been processed. For this example, let’s simply searched for RECEIVE events.

This shows us all of the RECEIVE events that this instance of NiFi has seen within the time range searched:

Provenance Search Results

If we click the Lineage Graph icon () on the right, for the first file, we see exactly what happened to this piece of data:

Data Lineage

We see that a RECEIVE event occurred, and that generated a FlowFile. That FlowFile’s attributes were then modified, its content was modified, and then the FlowFile was forked, and dropped. At this point, we can right-click on each of these event nodes and choose to view the details of the event. For the RECEIVE event, we see:

Receive Event

From here, we can see that the RECEIVE event took place at 16:55:51 EST on 01/11/2015. The component that reported the event was named “Data From Source X” and was a GetFile Processor. We can also see that the URI of the data was file:/C:/temp/in/debug/LICENSE.gz. If we follow the lineage of that FlowFile, we see that the next event is an ATTRIBUTES_MODIFIED Event:

Attrs Modified Event

Since the event is of type ATTRIBUTES_MODIFIED, it immediately begs the question “What attributes were modified?” So clicking the “Attributes” tab shows us this:

Attrs Modified Event

As with any Provenance Event, we can see all of the attributes that were present on the FlowFile when the event occurred. Of interest here, we can see that the value of the “mime.type” attribute was changed from “No value set” (the attribute didn’t exist) to “application/gzip”. The next event in our lineage is a CONTENT_MODIFIED Event. If we view the details here, we will see:

Content Modified Event

Here, we can see that the content was modified by a Processor named Decompress. This makes sense, since the previous Event showed us that the MIME Type was “application/gzip”. After decompressing the data, we arrive at a FORK Event:

Fork Event

This event shows us that the FORK happened by the SplitText Processor. That is, the SplitText Processor broke a large FlowFile into many smaller FlowFiles. On the right-hand side of this dialog, we see that the file was broken into six different “child” FlowFiles. This is where things get fun! If we then close this dialog, we can right-click on the FORK Event and choose the “Expand” option. This will then pull back the lineage for each of those children, providing us with a more holistic view of what happened to this piece of data:

Expanded Lineage

Now, we can see that each of those six children was sent somewhere. Viewing the details of these events shows us where they were sent:

Send Event

The first file, for instance, was sent via the PutHDFS Processor with the filename “/nifi/blogs/thinking-differently/LICENSE.gz”. This occurred at 16:55:53 EST on 01/11/2015. We can also see in this dialog the “Lineage Duration” was “00:00:02.712” or 2.712 seconds. The “Lineage Duration” field tells us how long elapsed between the time when the original source data was received and the time at which this event occurred.

Finally, we have the DROP event. The DROP event signifies the end of line for a FlowFile. If we look at the details of this event, we see:

Drop Event

Of note here, we see that the DROP event was emitted by PutHDFS. That is, PutHDFS was the last component in NiFi to process this piece of information. We can also see in the “Details” field why the FlowFile was dropped: it was Auto-terminated by the “success” relationship.

NiFi’s Data Provenance capability allows us to understand exactly what happens to each piece of data that is received. We are given a directed graph that shows when a FlowFile was received, when it was modified, when it was routed in a particular way, and when and where it was sent – as well as which component performed the action. We are also able to see, for each event, the attributes (or metadata) associated with the data so that we can understand why the particular event occurred. Additionally, when many pieces of data are merged together or a single piece of data is split apart, we are able to understand fully the provenance of this data from the that it was received until the time at which it exited the flow. This makes it very easy to answer the question “I sent you a file last week. What did you do with it?” while providing a much more holistic view of the enterprise dataflow than would be available if we used many disparate flows.

Hopefully this post helps you to understand not only the way that we like to setup the flows with NiFi but also the benefits that we have as a result and the features that allow us to overcome any challenges that this approach may create.

 

Reference: https://blogs.apache.org/nifi/entry/basic_dataflow_design

The post NiFi: Thinking Differently About DataFlow appeared first on The Big Data Blog.

Source: NiFi: Thinking Differently About DataFlow

Leave a Reply

Your email address will not be published. Required fields are marked *

  Posts

1 2 3
February 17th, 2016

Kaggle Competition Past Winner Solutions

We learn more from code, and from great code. Not necessarily always the 1st ranking solution, because we also learn […]

February 7th, 2016

Installing Kafka on Mac OSX

Apache Kafka is a highly-scalable publish-subscribe messaging system that can serve as the data backbone in distributed applications. With Kafka’s […]

February 5th, 2016

Lucene In-Memory Search Example and Sample Code

More sample code: https://github.com/fnp/pylucene/tree/master/samples/LuceneInAction  Sample code import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.queryParser.ParseException; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.search.*; […]

February 5th, 2016

PYLUCENE 3.0 IN 60 SECONDS — TUTORIAL and SAMPLE CODE

I pro­vide a basic index­ing and retrieval code using the PyLucene 3.0 API.Lucene In Action (2nd Ed) cov­ers Lucene 3.0, but […]

January 29th, 2016

NiFi: Thinking Differently About DataFlow

Recently a question was posed to the Apache NiFi (Incubating) Developer Mailing List about how best to use Apache NiFi […]

January 29th, 2016

Apache Nifi (aka HDF) data flow across data center

Short Description: This article provides a step by step overview of how to setup cross data center data flow using […]

January 24th, 2016

Accurately Measuring Model Prediction Error

When assessing the quality of a model, being able to accurately measure its prediction error is of key importance. Often, […]

January 9th, 2016

TIME SERIES FORECASTING – TAKING KAGGLE ROSSMANN CHALLENGE AS EXAMPLE

A time series is a sequence of data points, typically consisting of successive measurements made over a time interval. Forecasting […]

January 7th, 2016

Getting Started with Markov Chains

There are number of R packages devoted to sophisticated applications of Markov chains. These include msm and SemiMarkov for fitting […]

December 26th, 2015

Hadoop filesystem at Twitter

Twitter runs multiple large Hadoop clusters that are among the biggest in the world. Hadoop is at the core of […]