Streaming Data From Kafka to Big Query: Streaming Millions of Daily Events

5 min readNov 24, 2020

A shared solution for rapid streaming data from Kafka to Google BigQuery.

Introduction

Tiki has been growing up extremely fast in the past few years and the data are getting bigger and bigger. Therefore the data warehouse team needs an efficient system to stream large data from datasources (i.e: Mysql, PostgreSQL, MongoDB,…) to Big Query in near real-time, which will still guarantee acceptable latency, data accuracy, scalability, reliability and efficiency.

As a matter of fact, there are over 1000 tables with over millions of daily events in different datasources needing to stream to Big Query (BQ), so how is it supposed to be done? In this post, I will discuss and share a solution to build a data pipeline system.

Approach

There are two processes to stream data from a datasource to BQ. First of all, we use the Kafka Connectors (Debezium) that stream changes from the database and send them to some Kafka topics and every table is a topic. After this, we stream data from Kafka to BQ. Kafka as a message distributor for multiple consumers, streaming the data for their own needs. One of our main goals is streaming data for analysis, data input for other systems (i.e: Fraud System,…).

In this topic, we just focus on how to stream data from Kafka to BQ.

Old Solution

In order to get near real-time results, we consider the option of streaming data directly to BQ. BQ Apis allows streaming data in, up to a quota of 100K rows per project, per second.

There are 2 processes:

Consuming data from Kafka and then inserting it to the Staging dataset in BQ. In general, a topic in Kafka is a table in BQ and a message in the former is a row of the table in the latter. Except for the metadata of a message, we have to add more information such as offset, timestamp, backfill date and type (insert, delete and update),… before inserting it.
Merging data from the Staging dataset to the data warehouse in BQ. In this stage, the data is grouped by ID, sorted by offset and with every group of it, we take the first row and then insert or update it to the data warehouse.

Hence, it is needed to have a number of workers run asynchronously and continually in order to execute those two processes. However, to avoid two workers consuming the same topic, it is only allowed a worker to run and consume data using the Round Robin algorithm.

In case there are not many tables, this solution is simple, appropriate and fast. But If we need to stream over 100 tables or 1000 tables,… the delay time and data will be extremely high and affect other businesses.

New Solution

In order to deal with the above issue, we decided to write a new system (V2) that still keeps the concept and resolves the challenges about the scalability and delay time in near real-time.

There are three main components:

Consuming Worker (CW): Popping consuming tasks from Queue, reading information from the tasks (i.e: topic, partition, current offset,..), consuming data from Kafka to insert to the Staging dataset in BQ and updating the tasks’ status to Mysql. However, the number of workers must be smaller than or equal to that of partitions of the topics in Kafka. Although we can scale this component unlimitedly, it should depend on the number of topics and their config partitions.
Merging Worker (MW): Popping merging tasks from Queue, reading information from the tasks, building a merging query that merges data from the Staging dataset to data warehouse in BQ and updating the tasks’ status to Mysql. Likewise, It as well supports unlimited scalability but should still depend on BQ resources. Generally, the merging query runs and merges data by the Backfill field (i.e: by day, by month, none,…) so it helps avoid scanning full tables in BQ.
Producing Worker (PW): Creating consuming and merging tasks, pushing them to the period queues: each of the consuming tasks is distinguished by topic and partition, meanwhile that of the publishing tasks is differentiated by table name and backfill field. In this case, we use the Hazelcast library that supports distributed storage. The period queues that order elements are based on the number of lag messages of partition, so the topic having big lag messages numbers always comes high on the list of priorities.

How to combine the components:

Step 1: PW reads tasks from Mysql, if not existing, it generates and inserts them to both mysql and Queue. The workers will be run again after an interval time.
Step 2: CW and MW pop tasks from Queues then execute them. Once completed, update status and data tasks to Mysql.

With this architecture, we are able to support streaming data with over 1000 topics and the lag time under 30 seconds with regular updated tables. Also, the scalability is unlimited and data accuracy is guaranteed.

Conclusion

To build a project, we have to solve many problems. In this one, we face the problems in the old solution, find the root causes and discuss how to figure out a new solution to solve them and finally satisfy the development of data in the future. For this reason, There are three important things to build in this project:

Guaranteeing the accuracy of data: The messages are pushed to Kafka continually, we must design an efficient way to consume data from Kafka and to insert it to the Staging dataset in BQ as fast as impossible and the date in there must be exactly the same with the messages in Kafka (i.e: metadata, offset, number of messages). In addition, the merging data is also very important that after merging data in the Staging dataset to the data warehouse successfully, the data in BQ and Datasource must be the same because it is the input of other services and reports.
Guaranteeing the scalability: Even though the number of topics increase incessantly, the service always keeps the lag time not so high. And it serves different datasources.
Guaranteeing the reliability of near real-time data: whenever there are some new messages pushed to Kafka, the service will consume instantly and insert them to BQ.

The main contributors are:

Nguyễn Hoàng Nam — Senior Engineer
Thành Ngọc Đông — Software Engineer