Apache Kafka is a distributed streaming platform.
A streaming platform has three key capabilities:
- Publish and subscribe to streams of records, similar to a message queue or enterprise messaging system.
- Store streams of records in a fault-tolerant durable way.
- Process streams of records as they occur.
Kafka is generally used for two broad classes of applications:
- Building real-time streaming data pipelines that reliably get data between systems or applications.
- Building real-time streaming applications that transform or react to the streams of data.
- Kafka is run as a cluster on one or more servers that can span multiple datacenters.
- The Kafka cluster stores streams of records in categories called topics.
- Each record consists of a key, a value, and a timestamp.
Kafka has four core APIs:
- The Producer API allows an application to publish a stream of records to one or more Kafka topics.
- The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
- The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
- The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.
1 Apache Kafka Producer API
Kafka producers send records to topics. The records are sometimes referred to as messages. The producer picks which partition to send a record to per topic. The producer can send records round-robin. The producer could implement priority systems based on sending records to certain partitions based on the priority of the record.
Generally speaking, producers send records to a partition based on the record’s key. The default partitioner for Java uses a hash of the record’s key to choose the partition or uses a round-robin strategy if the record has no key.
The important concept here is that the producer picks partition.
To create a Kafka producer, you will need to pass it a list of bootstrap servers (a list of Kafka brokers). You will also specify a client.id that uniquely identifies this Producer client. In this example, we are going to send messages with ids. The message body is a string, so we need a record value serializer as we will send the message body in the Kafka’s records value field. The message id (long), will be sent as the Kafka’s records key. You will need to specify a Key serializer and a value serializer, which Kafka will use to encode the message id as a Kafka record key, and the message body as the Kafka record value.
Let's see a simple producer example:
package test.kafka; import kafka.javaapi.producer.Producer; import kafka.producer.KeyedMessage; import kafka.producer.ProducerConfig; import java.util.Properties; public class SimpleProducer { private static Producer<Integer, String> producer; private final Properties properties = new Properties(); public SimpleProducer() { properties.put("metadata.broker.list", "localhost:9092"); properties.put("serializer.class", "kafka.serializer.StringEncoder"); properties.put("request.required.acks", "1"); producer = new Producer<>(new ProducerConfig(properties)); } public static void main(String[] args) { new SimpleProducer(); String topic = args[0]; String msg = args[1]; KeyedMessage<Integer, String> data = new KeyedMessage<>(topic, msg); producer.send(data); producer.close(); } }
2 Apache Kafka Consumer API
You group consumers into a consumer group by use case or function of the group. One consumer group might be responsible for delivering records to high-speed, in-memory microservices while another consumer group is streaming those same records to Hadoop. Consumer groups have names to identify them from other consumer groups.
A consumer group has a unique id. Each consumer group is a subscriber to one or more Kafka topics. Each consumer group maintains its offset per topic partition. If you need multiple subscribers, then you have multiple consumer groups. A record gets delivered to only one consumer in a consumer group.
Each consumer in a consumer group processes records and only one consumer in that group will get the same record. Consumers in a consumer group load balance record processing.
2.1 Kafka Consumer Load Share
Kafka consumer consumption divides partitions over consumer instances within a consumer group. Each consumer in the consumer group is an exclusive consumer of a “fair share” of partitions. This is how Kafka does load balancing of consumers in a consumer group. Consumer membership within a consumer group is handled by the Kafka protocol dynamically. If new consumers join a consumer group, it gets a share of partitions. If a consumer dies, its partitions are split among the remaining live consumers in the consumer group. This is how Kafka does fail over of consumers in a consumer group.
2.2 Kafka Consumer Failover
Consumers notify the Kafka broker when they have successfully processed a record, which advances the offset.
If a consumer fails before sending commit offset to Kafka broker, then a different consumer can continue from the last committed offset.
If a consumer fails after processing the record but before sending the commit to the broker, then some Kafka records could be reprocessed. In this scenario, Kafka implements the at least once behavior, and you should make sure the messages (record deliveries ) are idempotent.
3 Apache Kafka Streams API
Kafka Streams is a Java library for building distributed stream processing apps using Apache Kafka .
Kafka Streams is a library for building streaming applications, specifically applications that transform input Kafka topics into output Kafka topics (or calls to external services, or updates to databases, or whatever). It lets you do this with concise code in a way that is distributed and fault-tolerant. Stream processing is a computer programming paradigm, equivalent to data-flow programming, event stream processing, and reactive programming, that allows some applications to more easily exploit a limited form of parallel processing.
There is a wealth of interesting work happening in the stream processing area—ranging from open source frameworks like Apache Spark, Apache Storm, Apache Flink, and Apache Samza, to proprietary services such as Google’s DataFlow and AWS Lambda—so it is worth outlining how Kafka Streams is similar and different from these things.
The gap we see Kafka Streams filling is less the analytics-focused domain these frameworks focus on and more building core applications and microservices that process data streams. I’ll dive into this distinction in the next section and start to dive into how Kafka Streams simplifies this type of application.
Despite being a humble library, Kafka Streams directly addresses a lot of the hard problems in stream processing:
- Event-at-a-time processing (not microbatch) with millisecond latency
- Stateful processing including distributed joins and aggregations
- A convenient DSL
- Windowing with out-of-order data using a DataFlow-like model
- Distributed processing and fault-tolerance with fast failover
- Reprocessing capabilities so you can recalculate output when your code changes
- No-downtime rolling deployments
3.1 Word count example
public static void main(String[] args) throws Exception { Properties streamsConfiguration = new Properties(); streamsConfiguration.put(StreamsConfig.APPLICATION_ID_CONFIG, "wordcount-lambda-example"); streamsConfiguration.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092"); streamsConfiguration.put(StreamsConfig.ZOOKEEPER_CONNECT_CONFIG, "localhost:2181"); streamsConfiguration.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName()); streamsConfiguration.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName()); final Serde<String> stringSerde = Serdes.String(); final Serde<Long> longSerde = Serdes.Long(); KStreamBuilder builder = new KStreamBuilder(); KStream<String, String> textLines = builder.stream(stringSerde, stringSerde, "TextLinesTopic"); KStream<String, Long> wordCounts = textLines .flatMapValues(value -> Arrays.asList(value.toLowerCase().split("\\W+"))) .map((key, word) -> new KeyValue<>(word, word)) // Required in Kafka 0.10.0 to re-partition the data because we re-keyed the stream in the `map` step. // Upcoming Kafka 0.10.1 does this automatically for you (no need for `through`). .through("RekeyedIntermediateTopic") .countByKey("Counts") .toStream(); wordCounts.to(stringSerde, longSerde, "WordsWithCountsTopic"); KafkaStreams streams = new KafkaStreams(builder, streamsConfiguration); streams.start(); Runtime.getRuntime().addShutdownHook(new Thread(streams::close)); }
4 Apache Kafka Connect API
Kafka Connect is a framework included in Apache Kafka that integrates Kafka with other systems. Its purpose is to make it easy to add new systems to your scalable and secure stream data pipelines.
To copy data between Kafka and another system, users instantiate Kafka Connectors for the systems they want to pull data from or push data to. Source Connectors import data from another system (e.g. a relational database into Kafka) and Sink Connectors export data (e.g. the contents of a Kafka topic to an HDFS file).
5 Kafka at Work
Suppose we are developing a massive multiplayer online game. In these games, players cooperate and compete with each other in a virtual world. Often players trade with each other, exchanging game items and money, so as game developers it is important to make sure players don’t cheat: Trades will be flagged if the trade amount is significantly larger than normal for the player and if the IP the player is logged in with is different than the IP used for the last 20 games. In addition to flagging trades in real-time, we also want to load the data to Apache Hadoop, where our data scientists can use it to train and test new algorithms.
For the real-time event flagging, it will be best if we can reach the decision quickly based on data that is cached on the game server memory, at least for our most active players. Our system has multiple game servers and the data set that includes the last 20 logins and last 20 trades for each player can fit in the memory we have, if we partition it between our game servers.
Our game servers have to perform two distinct roles: The first is to accept and propagate user actions and the second to process trade information in real time and flag suspicious events. To perform the second role effectively, we want the whole history of trade events for each user to reside in memory of a single server. This means we have to pass messages between the servers, since the server that accepts the user action may not have his trade history. To keep the roles loosely coupled, we use Kafka to pass messages between the servers, as you’ll see below.
Kafka has several features that make it a good fit for our requirements: scalability, data partitioning, low latency, and the ability to handle large number of diverse consumers. We have configured Kafka with a single topic for logins and trades. The reason we need a single topic is to make sure that trades arrive to our system after we already have information about the login (so we can make sure the gamer logged in from his usual IP). Kafka maintains order within a topic, but not between topics.
When a user logs in or makes a trade, the accepting server immediately sends the event into Kafka. We send messages with the user id as the key, and the event as the value. This guarantees that all trades and logins from the same user arrive to the same Kafka partition. Each event processing server runs a Kafka consumer, each of which is configured to be part of the same group—this way, each server reads data from few Kafka partitions, and all the data about a particular user arrives to the same event processing server (which can be different from the accepting server). When the event-processing server reads a user trade from Kafka, it adds the event to the user’s event history it caches in local memory. Then it can access the user’s event history from the local cache and flag suspicious events without additional network or disk overhead.
It’s important to note that we create a partition per event-processing server, or per core on the event-processing servers for a multi-threaded approach. (Keep in mind that Kafka was mostly tested with fewer than 10,000 partitions for all the topics in the cluster in total, and therefore we do not attempt to create a partition per user.)
This may sound like a circuitous way to handle an event: Send it from the game server to Kafka, read it from another game server and only then process it. However, this design decouples the two roles and allows us to manage capacity for each role as required. In addition, the approach does not add significantly to the timeline as Kafka is designed for high throughput and low latency; even a small three-node cluster can process close to a million events per second with an average latency of 3ms.
When the server flags an event as suspicious, it sends the flagged event into a new Kafka topic—for example, Alerts—where alert servers and dashboards pick it up. Meanwhile, a separate process reads data from the Events and Alerts topics and writes them to Hadoop for further analysis.
Because Kafka does not track acknowledgements and messages per consumer it can handle many thousands of consumers with very little performance impact. Kafka even handles batch consumers—processes that wake up once an hour to consume all new messages from a queue—without affecting system throughput or latency.
6 Additional Use Cases
As this simple example demonstrates, Kafka works well as a traditional message broker as well as a method of ingesting events into Hadoop.
Here are some other common uses for Kafka:
-
Website activity tracking
: The web application sends events such as page views and searches Kafka, where they become available for real-time processing, dashboards and offline analytics in Hadoop. -
Operational metrics
: Alerting and reporting on operational metrics. One particularly fun example is having Kafka producers and consumers occasionally publish their message counts to a special Kafka topic; a service can be used to compare counts and alert if data loss occurs. -
Log aggregation
: Kafka can be used across an organization to collect logs from multiple services and make them available in standard format to multiple consumers, including Hadoop and Apache Solr. -
Stream processing
: A framework such as Spark Streaming reads data from a topic, processes it and writes processed data to a new topic where it becomes available for users and applications. Kafka’s strong durability is also very useful in the context of stream processing.
7 Kafka and Its Alternatives
First, it is interesting to note that Kafka started out as a way to make data ingest to Hadoop easier. When there are multiple data sources and destinations involved, writing a separate data pipeline for each source and destination pairing quickly evolves to an unmaintainable mess. Kafka helped LinkedIn standardize the data pipelines and allowed getting data out of each system once and into each system once, significantly reducing the pipeline complexity and cost of operation.
7.1 Kafka vs Flink
The table below lists the most important differences between the two systems.
Flink Program | Streams API in Kafka Program | |
---|---|---|
Deployment | Flink is a cluster framework, which means that the framework takes care of deploying the application, either in standalone Flink clusters, or using YARN, Mesos, or containers (Docker, Kubernetes) | The Streams API is a library that any standard Java application can embed and hence does not attempt to dictate a deployment method; you can thus deploy applications with essentially any deployment technology — including but not limited to: containers (Docker, Kubernetes), resource managers (Mesos, YARN), deployment automation (Puppet, Chef, Ansible), and custom in-house tools. |
Life cycle | User’s stream processing code is deployed and run as a job in the Flink cluster | User’s stream processing code runs inside their application |
Typically owned by | Data infrastructure or BI team | Line of business team that manages the respective application |
Coordination | Flink Master (JobManager), part of the streaming program | Leverages the Kafka cluster for coordination, load balancing, and fault-tolerance. |
Source of continuous data | Kafka, File Systems, other message queues | Strictly Kafka with the Connect API in Kafka serving to address the data into, data out of Kafka problem |
Sink for results | Kafka, other MQs, file system, analytical database, key/value stores, stream processor state, and other external systems | Kafka, application state, operational database or any external system |
Bounded and unbounded data streams | Unbounded and Bounded | Unbounded |
Semantical Guarantees | Exactly once for internal Flink state; end-to-end exactly once with selected sources and sinks (e.g., Kafka to Flink to HDFS); at least once when Kafka is used as a sink, is likely to be exactly-once end-to-end with Kafka in the future | Exactly-once end-to-end with Kafka |
The fundamental differences between a Flink and a Streams API program lie in the way these are deployed and managed (which often has implications to who owns these applications from an organizational perspective) and how the parallel processing (including fault tolerance) is coordinated. These are core differences – they are ingrained in the architecture of these two systems.
7.1.1 Deployment and Organizational Management
A Flink streaming program is modeled as an independent stream processing computation and is typically known as a job. The entire lifecycle of a Flink job is the responsibility of the Flink framework; be it deployment, fault-tolerance or upgrades. The resources used by a Flink job come from resource managers like YARN, Mesos, pools of deployed Docker containers in existing clusters (e.g., a Hadoop cluster in case of YARN), or from standalone Flink installations. Flink jobs can start and stop themselves, which is important for finite streaming jobs or batch jobs. From an ownership perspective, a Flink job is often the responsibility of the team that owns the cluster that the framework runs, often the data infrastructure, BI or ETL team.
The Streams API in Kafka is a library that can be embedded inside any standard Java application. As such, the lifecycle of a Kafka Streams API application is the responsibility of the application developer or operator. The Streams API does not dictate how the application should be configured, monitored or deployed and seamlessly integrates with a company’s existing packaging, deployment, monitoring and operations tooling. From an ownership perspective, a Streams application is often the responsibility of the respective product teams.
Besides affecting the deployment model, running the stream processing computation embedded inside your application vs. as an independent process in a cluster touches issues like resource isolation or separation vs. unification of concerns. For instance, running a stream processing computation inside your application means that it uses the packaging and deployment model of the application itself. And running a stream processing computation on a central cluster means that you can allow it to be managed centrally and use the packaging and deployment model already offered by the cluster. Likewise, running a stream processing computation on a central cluster provides separation of concerns as the stream processing part of the application’s business logic lives separately from the rest of the application and the message transport layer (for example, this means that resources dedicated to stream processes are isolated from resources dedicated to Kafka). On the other hand, running a stream processing computation inside your application is convenient if you want to manage your entire application, along with the stream processing part, using a uniform set of operational tooling. Depending on the requirements of a specific application, one or the other approach may be more suitable.
Call out: Stream processing is used in a variety of places in an organization — from user-facing applications to running analytics on streaming data. The Streams API in Kafka and Flink are used in both capacities. The main distinction lies in where these applications live — as jobs in a central cluster (Flink), or inside microservices (Streams API).
7.1.2 Distributed Coordination and Fault Tolerance
The biggest difference between the two systems with respect to distributed coordination is that Flink has a dedicated master node for coordination, while the Streams API relies on the Kafka broker for distributed coordination and fault tolerance, via the Kafka’s consumer group protocol. While this sounds like a subtle difference at first, the implications are quite significant.
In Apache Flink, fault tolerance, scaling, and even distribution of state are globally coordinated by the dedicated master node. Flink’s master node implements its own high availability mechanism based on ZooKeeper. A failure of one node (or one operator) frequently triggers recovery actions in other operators as well (such as rolling back changes). This approach helps Flink to get its high throughput with exactly once guarantees, it enables Flink’s savepoint feature (for application snapshots and program and framework upgrades), and it powers Flink’s exactly-once sinks (e.g., HDFS and Cassandra, but not Kafka). Even for nondeterministic programs, Flink can that way guarantee results that are equivalent to a valid failure-free execution. It is worth pointing out that since Kafka does not provide an exactly-once producer yet, Flink when used with Kafka as a sink does not provide end to end exactly-once guarantees as a result.
The Streams API in Kafka provides fault-tolerance, guarantees continuous processing and high availability by leveraging core primitives in Kafka. Each shard or instance of the user’s application or microservice acts independently. All coordination is done by the Kafka brokers; the individual application instances simply receive callbacks to either pick up additional partitions (scale up) or to relinquish partitions (scale down). Fault tolerance is built-in to the Kafka protocol; if an application instance dies or a new one is started, it automatically receives a new set of partitions from the brokers to manage and process. The application that embeds the Streams API program does not have to integrate with any special fault tolerance APIs or even be aware of the fault tolerance model. This allows for a very lightweight integration; any standard Java application can use the Streams API.
To summarize, while the global coordination model is powerful for streaming jobs in Flink, it works less well for standalone applications and microservices that need to do stream processing: the application would have to participate in Flink’s checkpointing (implement some APIs) and would need to participate in the recovery of other failed shards by rolling back certain state changes to maintain consistency. That is clearly not as lightweight as the Streams API approach. Again, both approaches show their strength in different scenarios.
7.2 Kafka vs Flume
There is significant overlap in the functions of Flume and Kafka. Here are some considerations when evaluating the two systems.
- Kafka is very much a general-purpose system. You can have many producers and many consumers sharing multiple topics. In contrast, Flume is a special-purpose tool designed to send data to HDFS and HBase. It has specific optimizations for HDFS and it integrates with Hadoop’s security. As a result, Cloudera recommends using Kafka if the data will be consumed by multiple applications, and Flume if the data is designated for Hadoop.
- Those of you familiar with Flume know that Flume has many built-in sources and sinks. Kafka, however, has a significantly smaller producer and consumer ecosystem, and it is not well supported by the Kafka community. Hopefully this situation will improve in the future, but for now: Use Kafka if you are prepared to code your own producers and consumers. Use Flume if the existing Flume sources and sinks match your requirements and you prefer a system that can be set up without any development.
- Flume can process data in-flight using interceptors. These can be very useful for data masking or filtering. Kafka requires an external stream processing system for that.
- Both Kafka and Flume are reliable systems that with proper configuration can guarantee zero data loss. However, Flume does not replicate events. As a result, even when using the reliable file channel, if a node with Flume agent crashes, you will lose access to the events in the channel until you recover the disks. Use Kafka if you need an ingest pipeline with very high availability.
- Flume and Kafka can work quite well together. If your design requires streaming data from Kafka to Hadoop, using a Flume agent with Kafka source to read the data makes sense: You don’t have to implement your own consumer, you get all the benefits of Flume’s integration with HDFS and HBase, you have Cloudera Manager monitoring the consumer and you can even add an interceptor and do some stream processing on the way.
7.3 Kafka vs Ramit MQ
Kafka is a general purpose message broker, like RabbItMQ, with similar distributed deployment goals, but with very different assumptions on message model semantics.
Use Kafka if you have a fire hose of events (20k+/sec per producer) you need delivered in partitioned order 'at least once' with a mix of online and batch consumers, but most importantly you’re OK with your consumers managing the state of your “cursor” on the Kafka topic.
Use Rabbit if you have messages (20k+/sec per queue) that need to be routed in complex ways to consumers, you want per-message delivery guarantees, you need one or more features of protocols like AMQP 0.9.1, 1.0, MQTT, or STOMP, and you want the broker to manage that state of which consumer has been delivered which message.