Apache Kafka: review
Today we offer you a comparatively brief, but also sensible and informative article about the device and applications of Apache Kafka. We expect to translate and release book Niya Narkhide (Neha Narkhede) et. al until the end of the summer.
Streams API .
This API is designed to be used within your own code base, it does not work on the broker. Functionally it is similar to the consumer API, facilitates the horizontal scaling of the processing of flows and its distribution between several applications (similar to consumer groups).
Processing without saving state
Processing without state preservation is a stream of deterministic processing that does not depend on any external factors. As an example, consider this simple data transformation: attach the information to the line
"Hello" -> "Hello, World!"
It is important to understand that threads and tables are, in essence, the same thing. The flow can be interpreted as a table, and the table as a stream.
Flow as table
If you pay attention to how synchronous replication of the database is performed, then obviously it's stream replication , where any changes to the tables are sent to the copy server (replica). The Kafka stream can be interpreted in exactly the same way - as the update stream for data that aggregates and yields the final result that appears in the table. Such streams are stored in the local RocksDB (the default) and are called KTable .
The table is like the stream
The table can be considered an instant snapshot, reflecting the last value for each key in the stream. Similarly, you can create a table from the stream records, and from the table updates you can create a stream with a change log.
With each update, you can take a snapshot of the stream (record)
Processing with preservation of state
Some simple operations, for example,
filter (), are executed without saving the state, and we do not have to store any data concerning their processing. However, in practice, most operations are performed with preservation of the state (eg
count ()), So you naturally need to store the state that has developed to the present moment.
The problem with maintaining the state of the stream processors is that these processors sometimes fail! Where should this state be stored to ensure fault tolerance?
A simplified approach is to simply store all state in a remote database and connect to this repository over the network. The problem is that then the locality of the data is lost, and the data itself is repeatedly redirected over the network - both factors significantly inhibit your application. A more subtle but important problem is that the activity of your thread processing job will depend severely on the remote database - that is, this task will not be self-sufficient (all your processing can fail if another team makes any changes to the database) .
So, which approach is better?
Again, we recall the dualism of tables and flows. It is thanks to this property that streams can be converted to tables located exactly where processing takes place. Also, we get a mechanism providing fault tolerance - we store the flows on the Kafka broker.
A thread processor can save its state in a local table (for example, in RocksDB), which will update the input stream (possibly after some arbitrary transformations). If this process fails, we will be able to restore the corresponding data by repeatedly reproducing the stream.
You can even get the remote database to generate a stream and, in fact, broadcast the change log, based on which you will rebuild the table on the local machine.
Stateful processing, KStream connection with KTable
Typically, the code for processing threads must be written in one of the languages for the JVM, since it is the only official client of the Kafka Streams API that works with it.
A sample installation of KSQL
KSQL Is a new feature that allows you to write simple streaming tasks in a familiar language that resembles SQL.
We configure the KSQL server and interactively request it through the CLI to control the processing. It works exactly with the same abstractions (KStream and KTable), guarantees the same benefits as the Streams API (scalability, fault tolerance) and greatly simplifies the work with threads.
Perhaps all this does not sound inspiring, but in practice it is very useful for testing the material. Moreover, this model allows you to join the stream processing even for those who do not participate in the development as such (for example, the owners of the product). I recommend to see a small intro video - see for yourself how simple everything is here.
Alternative to stream processing
Flows Kafka - the perfect combination of strength and simplicity. Perhaps Kafka is the best tool for performing streaming tasks on the market, and integrating with Kafka is much easier than with alternative tools for stream processing ( .Storm , , Samza , , Spark , Wallaroo ).
The problem with most other thread processing tools is that they are difficult to deploy (and difficult to handle). A framework for batch processing, for example, Spark, requires:
Manage a large number of tasks on the pool of machines and efficiently distribute them in a cluster.
This requires dynamically packaging the code and physically deploying it on the nodes where it will be executed (plus configuration, libraries, etc.)
Unfortunately, when trying to solve all these problems within the same framework, this framework turns out to be too invasive. The framework tries to control all aspects of deployment, configuring code, monitoring it, and packaging the code.
Kafka Streams allows you to formulate your own deployment strategy when you need it, and, to work with the tool to your taste: Kubernetes , Mesos , Nomad , Docker Swarm and the like
Kafka Streams is designed primarily so that you can organize streaming processing in your application, however, without the operational complications associated with supporting the next cluster. The only potential drawback of this tool is its close connection with Kafka, however, in the current reality, when stream processing is mainly performed with the help of Kafka, this small flaw is not so terrible.
When should I use Kafka?
As mentioned above, Kafka allows you to pass through a centralized environment a huge number of messages, and then store them without worrying about performance and without fear that the data will be lost.
Thus, Kafka perfectly fits in the very center of your system and will work as a link, ensuring the interaction of all your applications. Kafka can be a central element of the event-oriented architecture, which will allow you to properly detach applications from each other.
Kafka makes it easy to differentiate between different (micro) services. Working Streams API, it became more than ever easy to write business logic that enriches data from the Kafka topic before they are consumed by services. Here are the greatest opportunities - so I highly recommend that you learn how Kafka is used in different companies.
Apache Kafka is a distributed threading platform that allows you to process trillions of events a day. Kafka guarantees minimum delays, high throughput, provides fail-safe pipelines that operate on a "publish /subscribe" basis and allow processing of event flows.
In this article we got acquainted with the basic semantics of Kafka (learned what a generator, broker, consumer, theme), learned about some optimization options (page cache), learned what fault tolerance is guaranteed by Kafka when replicating data and briefly discussed its powerful streaming capabilities.
It may be interesting