Apache NiFi: what it is and a brief overview of the features

Today, on thematic foreign sites about Big Data, one can find the mention of such a relatively new for the Hadoop ecosystem tool like Apache NiFi. This is a modern open source ETL tool. Distributed architecture for fast parallel loading and processing of data, a large number of plug-ins for sources and transformations, versioning of configurations is only part of its advantages. For all its power, NiFi remains fairly simple to use.
Apache NiFi: what it is and a brief overview of the features
At Rostelecom, we are striving to develop work with Hadoop, so we have already tried and evaluated the advantages of Apache NiFi compared to other solutions. In this article I will tell you what attracted us to this tool and how we use it.
here and 3r3733. here
3r33333. Key features 3r33636.
NiFi uses a web interface to create a dataflow. The analyst who recently started working with Hadoop, the developer, and the bearded admin will cope with it. The last two can interact not only with “rectangles and arrows”, but also with REST API for collecting statistics, monitoring and managing DataFlow components.

NiFi web management interface.
Below, I will show a few DataFlow examples for performing some common operations.
An example of downloading files from an SFTP server to HDFS
In this example, the ListSFTP processor lists files on a remote server. The result of this listing is used for parallel downloading of files by all the nodes of the cluster by the “FetchSFTP” processor. After this, attributes are added to each file, obtained by parsing its name, which are then used by the PutHDFS processor when writing the file to the destination directory.
Example of loading data on syslog in Kafka and HDFS
Here, using the ListenSyslog processor, we get the input message flow. After that, each group of messages is added with attributes about the time of their arrival at NiFi and the name of the scheme in Avro Schema Registry. Next, the first branch is sent to the input of the QueryRecord processor, which, based on the specified scheme, reads the data and parses it using SQL, and then sends it to Kafka. The second branch is sent to the “MergeContent” processor, which aggregates the data for 10 minutes, then gives it to the next processor for conversion to the Parquet format and writing to HDFS.
Here is an example of how you can still make DataFlow:
Downloading syslog data to Kafka and HDFS. Clearing data in Hive
Now about data conversion. NiFi allows you to parse data regularly, run SQL on it, filter and add fields, convert one data format to another. It also has its own expression language, rich in various operators and built-in functions. With it, you can add variables and attributes to the data, compare and calculate values, and use them later when generating various parameters, such as the path to write to HDFS or the SQL query to Hive. Read more can be read 3r3137. here 3r33364. .
An example of using variables and functions in the UpdateAttribute processor
The user can track the full path of the data, watch for changes in their contents and attributes.
Visualization of the DataFlow chain

View content and data attributes
For versioning DataFlow there is a separate service 3r3174. NiFi Registry
. By setting it up, you get the opportunity to manage change. You can launch local changes, roll back or download any previous version.
Menu Version Control
In NiFi, you can control access to the web interface and the separation of user rights. The following authentication mechanisms are currently supported:
3r33333. Based on certificates
3r33333. Based on username and password through LDAP and Kerberos
3r33333. Through Apache Knox
3r33333. Through OpenID Connect
Simultaneous use of several mechanisms at once is not supported. For authorization of users in the system FileUserGroupProvider and LdapUserGroupProvider are used. Read more about this here .
As I said, NiFi can work in cluster mode. This provides fault tolerance and makes it possible to scale the load horizontally. There is no static fixed master node. Instead, Apache Zookeeper selects one node as the coordinator and one as the primary. The coordinator receives information from other nodes about their status and is responsible for connecting and disconnecting them from the cluster.
Primary-node is used to run isolated processors that should not run on all nodes at the same time.

Work NiFi in a cluster

Load distribution across cluster nodes using the example of PutHDFS processor 3r-3262.
3r33333. A brief description of the architecture and components of NiFi
[i] Architecture NiFi-instansa

NiFi is based on the concept of “Flow Based Programming” (3r-3267. FBP 3r-?364.). Here are the basic concepts and components that each user encounters:
3r3333317. FlowFile [/b] - an entity that is an object with a content of zero or more bytes and the corresponding attributes. This can be either the data itself (for example, the Kafka message flow), or the result of the processor (PutSQL, for example), which does not contain the data as such, but only the attributes generated as a result of the query. Attributes are FlowFile metadata.
3r3333317. FlowFile Processor [/b] - this is exactly the essence that performs the main work in NiFi. The processor, as a rule, has one or several functions for working with FlowFile: creating, reading /writing and changing content, reading /writing /changing attributes, routing. For example, the “ListenSyslog” processor receives data via the syslog protocol, creating FlowFile’s output with the attributes syslog.version, syslog.hostname, syslog.sender, and others. The RouteOnAttribute processor reads the attributes of the input FlowFile and decides whether to redirect it to the appropriate connection with another processor depending on the attribute values.
3r3333317. Connection [/b] - provides connection and transfer of FlowFile between different processors and some other NiFi entities. Connection puts FlowFile in the queue, and then passes it further along the chain. You can customize how FlowFiles are selected from the queue, their lifetime, the maximum number and maximum size of all objects in the queue.
3r3333317. Process Group 3r3188. - a set of processors, their connections and other elements of DataFlow. It is a mechanism for organizing multiple components into one logical structure. Allows you to simplify the understanding of DataFlow. Input /Output Ports are used to receive and send data from Process Groups. More information about their use can be read here .
3r3333317. FlowFile Repository [/b] - this is the place where NiFi stores all the information it knows about every existing FlowFile in the system.
3r3333317. Content Repository [/b] - the repository in which the contents of all FlowFile are located, i.e. the data being transferred.
3r3333317. Provenance Repository [/b] - Contains a story about each FlowFile. Each time a flow event occurs (create, modify, etc.), the corresponding information is entered into this repository.
3r3333317. Web Server 3r33318. - provides web interface and REST API.
3r33333. Conclusion 3r33336.
With the help of NiFi, Rostelecom was able to improve the delivery mechanism of data to Data Lake on Hadoop. In general, the whole process has become more convenient and reliable. Today, I can say with confidence that NiFi is great for doing downloads in Hadoop. We have no problems with its operation.
By the way, NiFi is included in the Hortonworks Data Flow distribution and is actively developed by Hortonworks itself. And he also has an interesting Apache MiNiFi subproject, which allows you to collect data from various devices and integrate them into DataFlow inside NiFi.
3r33333. Additional information about NiFi
3r33333. Official page
project documentation.  
3r33333. Collection interesting articles r3r3364. About NiFi from one of the project participants
Blog about NiFi
one of the developers
3r33333. Articles on the portal Hortonworks
Perhaps that's all. Thank you all for your attention. Write in the comments if you have questions. I will answer them with pleasure.
+ 0 -

Add comment