How Apache Kafka tames the data stream

The challenges of ‘data in motion’ were addressed in a round table event moderated by Arthur Goldstuck. He gives an overview of the discussion.

The digital world is defined by the data “stream.” But there is a serious challenge facing businesses as they get to grips with their data.

In the world of consumers, streaming means content and video conferences. In businesses, it’s about moving massive amounts of data. While the former is sorted out by a decent broadband connection, the latter demands a vast ecosystem of physical and digital parts, and many companies are falling short.

This message was brought home powerfully during a recent round table discussion in Cape Town, hosted by software company Synthesis and solutions provider Confluent, on the topic of “event streaming.” This refers to ongoing delivery of vast amounts of data “events,” defined as data points emerging from any system that is continually updating and therefore continually creates data. The challenge for large businesses is that, when that data scales up rapidly, they cannot manage this “data in motion.”

Confluent enables event streaming using a technology called Apache Kafka®, first developed at LinkedIn in 2011 to manage the network’s vast stream of messaging, and open-sourced in 2012. Its original developers founded Confluent to help companies scale up their event streaming, and found a ready clientele in organisations like Walmart, Intel, and Expedia. Synthesis, a Confluent partner in South Africa, is well-known for its cloud management software. Local users include Sanlam, Vitality and BankServ.

Participants in the round-table discussion included Sanlam, Vodacom, MTN, The Foschini Group, Capitec bank, PPS, and Trackmatic.

Jack Bingham, Confluent’s Account Executive for Africa, neatly summed up the role of Apache Kafka: “Kafka was born out of LinkedIn for the reason that data just wasn’t moving in real time and you can’t reasonably act on top of that. How can you provide products and experiences and services that won’t fit the purpose in that exact moment?

“There’s a move away from legacy, slow, non-real time services. Through the Kafka project, we started to change the thinking about data at rest and data in motion, and Kafka became the event streaming data infrastructure for 70% of the Fortune 500. Kafka is at the core, but around that you need periphery services that help connect legacy systems to the modern cloud systems and enable data to be put into Kafka. Once data is available, you can reasonably act in real time to create new experiences, new services.

“A real time data infrastructure lets you build real time analytics and then make products that are reusable. And then, naturally, the next conversation in some organisations and industries is plugging in artificial intelligence and machine learning technology, knowing that the data you have is high quality, it’s accurate, it’s real time, and it’s available at huge scale. You take the challenge of legacy out of the equation, and you guarantee scale.”

It’s easier said than done, of course. Darren Bak, Head of Intelligent Data at Synthesis, told attendees that, while Kafka comes with natural resilience, “you do have to put in a lot of hard work to make it work.”

Craig Strachan, Business Intelligence Solutions Architect at of Sanlam, concurred: “A lot of our systems are not cloud-based, Dynamics 365 runs in the Azure cloud, and other systems are running on Amazon. So we are starting to have data in different areas and big pieces of data. In theory, it’s just a new source. But in practice, it’s huge amounts of data, and there are a lot of costs in moving this data around.

“We’re trying to use tools like Kafka to help us figure out what we need to move around. Can it be processed locally and just bring the results back and make it easier to manage? That’s one of the challenges we are trying to work on. We haven’t solved it, but we’re getting the tooling in place.”

Gary Rhode, Data Architect at PPS, said that while the insurance company was still exploring Kafka, it recognised that “big data and big query are all big requirements for organisations.” The key, however, was having real-time sight of data events.

“There are different timelines for our events,” he said. “They are per-minute, per-hour, per-day, and we are unfortunately still stuck with a world without real-time, so our view of time-based events is not instant sometimes. We have learned from experience with the data that we have data that is two days in arrears, a day in arrears, two hours in arrears. The journey for us now is how to get that data into a more fluid state and a more real time state to make sense?

“We are lifting certain condition on things like compliance, to make that real-time. You don’t have to wait until tomorrow; we can actually see it per hour. So we are trying to get a legacy-based business into a real-time, per-moment business. But it’s complicated systems and complicated solutions being used at the moment.”

Other use cases expressed by participants in the round table ranged from managing microservices to simply “having fun” with data.

Daniel Croft, Confluent’s Technical Lead for South Africa, described how a major automobile manufacturer was using Kafka for real-time inventory management and real-time maintenance on cars.

Tjaard du Plessis, Head of Digital and Emerging Technology at Synthesis, gave the examples of personalisation in search engines and banking as a signpost to what was coming next:

“They can use all the search data and they can predict what you are going to do next, or what’s going to happen next. And then, at the right time, they can give you back something that you actually need. In the banks, they have the data, they have the machine learning models, and they can predict, if you were to do X, Y, and Z, this is probably your next move. And therefore they can provide a product that’s actually helpful to me as a customer because it’s something I’m going to need next.”

The significance is clear for data in motion: “The gap that Kafka can close in the future, is being able to get those insights running at the right time, because the biggest problem with those data stores is that it’s stale. Or it’s obfuscated, maybe by a micro service, or an API, so it’s very hard to get to that data, run your model, and then act.

“The power of Kafka is, let’s turn that database inside out, let’s get all of the database and storage and wherever we’re storing data, onto a platform that doesn’t store data but stores events. As soon as you have events, you can run a certain number of events through your model at any time. Let’s say someone tweets. That’s a new event. And right at that time when that person tweets, you can immediately run your prediction and see what’s next. ‘Oh, the sentiment was negative, he’s maybe going to churn away from us’.

“When we’re in real time, all of those features that Kafka provide comes together. It provides real time intelligence. I’m really excited about AI coupled with fast-moving, real-time data. That’s what we will use in future.”