PostgreSQL to BQ — Incremental materialization using dbt

Vishnuram Nadarajan
The Cloudside View
Published in
8 min readMar 18, 2024

--

In this blog post, we will start with quick ingestion of PostgreSQL data using AirByte to BigQuery and we will demonstrate how to use dbt to achieve incremental materialization. I assume that you already have AirByte installed. Let’s get started.

Connect to PostgreSQL with Airbyte

Once we’ve successfully set up the source, the next step is to set up a BigQuery connection as the destination.

I have configured it to extract and load two tables Employee and Branch from Postgres to Bigquery every 24 hours with full refresh overwrite. You can configure based on your need and finally select set up the connection.

Normalization and Transformation

In Airbyte, you have the option to ingest raw data in JSON format and transform it into a normalized tabular format. This means that Airbyte can take raw JSON data and structure it into a tabular format with distinct rows and columns, optimizing it for storage and analysis in a relational database or data warehouse. This normalization process ensures that the data is organized efficiently and ready for use in analytics or other downstream processes.

After the completion of the first sync, two tables (branch and employee) from Postgres were loaded into BigQuery successfully.

Transformation with DBT

Data Build Tool(DBT), an open-source data transformation tool that organizes and improves the quality of analytics workflows. It ensures data accuracy through testing and monitoring. With DBT, you can compile and run your analytics code on your data platform

Materialization in DBT

In DBT, materialization refers to how the results of a DBT model are stored in the destination data warehouse or database. DBT offers several types of materialization options, each serving different purposes and use cases.

Here are the common types of materialization in DBT:

Table materialization is the default and most common option in DBT. When a model is materialized as a table, DBT creates a physical table in the destination data warehouse. The SQL query defined in the model is executed to generate the output, which is then saved in this table. Subsequent runs of the model will overwrite the existing table with the updated data.

Example usage: {{config(materialized=’table’)}}

View materialization DBT creates a logical view in the destination database instead of a physical table. The SQL query defined in the model is executed whenever the view is queried, so the data is not physically stored in the view itself. This is useful when you want to create a virtual table based on the underlying data models.

Example usage: {{config(materialized=’view’)}}

Incremental materialization is used for models that process only the updated data or new data during each run, reducing processing time and resource consumption. This is suitable for scenarios where the data volume is large, and the transformation process can be optimized by processing only the incremental changes.

Example usage: {{config(materialized=’incremental’)}}

Ephemeral materialization is a temporary materialization option where the data is not persisted in the destination database or data warehouse. Instead, the output is created as a temporary table during the execution of the model and discarded after the query is complete.

Example usage: {{config(materialized=’ephemeral’)}}

Installing and Configuring DBT: Install and set up DBT and define your DBT project.

Connect DBT with your BigQuery dataset to transform the loaded data efficiently.

To create a new project, use the “dbt init” command followed by the desired project name. For instance, you can initiate a project named “test” by executing “dbt init test”.

Select the database you wish to work with by entering the corresponding number from the prompt, such as choosing BigQuery, as mentioned earlier.

After selecting the database, you will be prompted to choose the authentication method for your selected data platform. For example, if you are using a GCP service account, provide the “key_file.json” of your project, along with the project ID and dataset you are working with.

Next, specify the number of threads and job execution timeout for optimal performance. Additionally, you can select the location from the available options. By default, there are only two locations available, so choose one of them, and remember that you can update it later in the “.dbt/profiles.yml” file.

.dbt/profiles.yml file, which is automatically created in the home directory of the root user upon installing DBT. This file contains essential details about the dbt project.

Once the DBT project is initiated, default folders are created.

models in DBT are SQL-based artefacts that define the logic for transforming data. They enable modularity, dependency management, incremental processing, and materialization into tables or views. Models are the building blocks of data pipelines, promoting efficiency, reusability, and data quality testing.

Best practice for creating a structure in dbt for models

Creating a well-designed data warehouse structure in DBT involves defining guidelines for organising, naming, and documenting data models, ensuring understandability and maintainability as data scales. This consistent structure fosters collaboration and contributes to project quality, reliability, and maintainability.

In DBT, there are three types of structures

  • Staging
  • Intermediate
  • Mart

Inside models/example, default two models my_first_dbt_model.sql and my_second_dbt_model.sql were created and the schema.yml file is used to configure and control the structure of the tables or views created by DBT models and it allows you to rename columns, specify data types, define constraints, and set materialization options.

To configure the transformation query, three folders staging, intermediate, and mart were created.

Staging models, serve as building blocks for data models, performing basic computations and renaming without joins or aggregations. They are not the final output.

I’m creating two Sql files “employee.sql” and “branch.sql” inside the Staging folder, with the ephemeral materialisation

Intermediate models, bring together multiple models to transform and clean data, including derived fields, aggregates, joins, and enforcing business rules like data quality checks.

I’m creating a Sql file named “join.sql”, In this I have joined those employee and branch Sql query files, with ephemeral materialization.

Mart models represent fine-grained business entities, consumed by end-users and BI tools for analysis. They store transformed and enriched data, providing a clean, reliable, and optimized source of information.

I’m creating a Sql file “employee_details.sql”, with incremental materialization.

When you define a model with materialized=’incremental’ unique_key should be defined because dbt uses the unique_key to determine which rows in the source data have changed since the last run. If unique_key is not given, then the whole data will be append again on the next run which results in duplication of data.

These staging, intermediate, and mart layers are crucial components of a well-structured dbt model, enabling a robust data warehouse design for efficient data processing and insightful analytics.

Collaboration and Version Control with GitHub

Collaboration and version control are essential for maintaining the integrity and consistency of your data analytics projects. GitHub, a widely used version control platform, is a perfect fit for integrating with Airbyte and DBT.

Install Git on a local machine and also log in to github and create a repository,

Connecting DBT with GitHub: Configure DBT to sync your transformations with the GitHub repository. This way, any changes made to the DBT models will be tracked and versioned.

Once the repository is created, follow these steps to initiate Git and to host your DBT project and related files.

git init
git add .
git commit -m "first commit"
git branch -M main
git remote add origin (paste github repository)
git push -u origin main

After connecting DBT with Github, you can see that created models in DBT were cloned in the Github repository,

Inside the code, a web URL will be generated, this URL is used for collaborating with Airbyte synchronization.

Enable Transformation in Airbyte

To enhance the connection established earlier, we will now incorporate transformation capabilities using DBT. This involves integrating the DBT GitHub link into Airbyte, allowing seamless data transformations and enriching the data pipeline for more advanced analytics.

In Custom Transformation, select Add Transformation. In that paste the URL of GitHub (Git repository URL of the custom transformation project) and select save transformation.

Output

Once the sync is enabled in Airbyte, data from PostgreSQL is extracted and loaded into BigQuery, with transformations applied and stored as an incremental table “employee_details”.

Source tables (employee and branch) in Postgresql,

Bigquery Transformed table (employee_details)

From the above case, the source table and transformed table have the same number of 8 rows, now I’m updating one row and adding one row in the source Postgres table,

Now the Airbyte sync was triggered manually, The incremental sync is based on the id column in the employee_details, ensuring that only new and updated data from the source table are processed and transformed into the BigQuery table “employee_details”.

It is observed that the newly added row and updated row in the Postgres table were transformed in the Bigquery “employee_details” table. This approach significantly reduces time consumption when processing large datasets as only the relevant data is synchronized.

--

--