1 Apache Flink

Apache Flink is an open source stream processing framework developed by the Apache Software Foundation. The core of Apache Flink is a distributed streaming dataflow engine written in Java and Scala. Flink executes arbitrary dataflow programs in a data-parallel and pipelined manner. Flink's pipelined runtime system enables the execution of bulk/batch and stream processing programs. Furthermore, Flink's runtime supports the execution of iterative algorithms natively.

Flink provides a high-throughput, low-latency streaming engine as well as support for event-time processing and state management. Flink applications are fault-tolerant in the event of machine failure and support exactly-once semantics. Programs can be written in Java, Scala, Python, and SQL and are automatically compiled and optimized into dataflow programs that are executed in a cluster or cloud environment.

Flink does not provide its own data storage system and provides data source and sink connectors to systems such as Amazon Kinesis, Apache Kafka, HDFS, Apache Cassandra, and ElasticSearch.

1.1 Word count example

Copy
public static void main(String[] args) throws Exception {

	// set up the execution environment
	final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

	// get input data
	DataStream<String> text = getTextDataStream(env);

	DataStream<Tuple2<String, Integer>> counts =
			// normalize and split each line
			text.map(line -> line.toLowerCase().split("\\W+"))
			// convert splitted line in pairs (2-tuples) containing: (word,1)
			.flatMap((String[] tokens, Collector<Tuple2<String, Integer>> out) -> {
				// emit the pairs with non-zero-length words
				Arrays.stream(tokens)
				.filter(t -> t.length() > 0)
				.forEach(t -> out.collect(new Tuple2<>(t, 1)));
			})
			// group by the tuple field "0" and sum up tuple field "1"
			.keyBy(0)
			.sum(1);

	// emit result
	counts.print();

	// execute program
	env.execute("Streaming WordCount Example");
}