Getting Started with GCP Dataform: Streamlining Your Data Workflow

Yuvraj singh
The Cloudside View
Published in
5 min readAug 18, 2023

--

In the realm of data engineering, the shift from ETL (Extract, Transform, Load) pipelines to ELT (Extract, Load, Transform) pipelines is gaining momentum. Why? Because with the advent of MPP data warehouses like BigQuery and RedShift, you really don’t need to plan extraction, loading, and transformation — in that order. These platforms allow us to ingest data with minimal effort and using the familiar SQL interface, once can transform data as per the specific use case.

However, the allure of pure SQL-driven transformations can sometimes lead to a tangled web of views, tables, functions, and scripts. This is precisely the problem that tools like DBT solve. In GCP ecosystem, we have Dataform, which is similar to dbt. Dataform is a tool designed to streamline and organize your data transformation code while adhering to best practices. In this blog, we’ll introduce you to Dataform, walk you through its benefits, and guide you through setting up a sample project, all in a beginner-friendly manner.

Meet Dataform: What’s It All About?

At its core, Dataform serves as a bridge between code and data transformation, creating a harmonious environment for both engineers and analysts. It empowers you to craft, test, and maintain data pipelines using SQL, all while integrating key software development practices like version control and automation.

Benefits of Using Dataform

  1. Modular Transformations: Dataform’s modular approach ensures consistency and reusability across projects, simplifying maintenance and development.
  2. Version Control and Collaboration: By treating transformations as code, Dataform enables collaboration through Git. You can effortlessly manage changes, track history, and enhance teamwork.
  3. Testing at Scale: Dataform offers built-in testing capabilities, letting you validate transformations before deployment, and reducing the risk of errors.
  4. Effortless Deployment: Deploying transformations is a breeze with Dataform, automating the process and saving time.
  5. Documentation and Clarity: Document your transformations directly within the code, making it easier for team members to comprehend the logic.

Your First Dataform Project: A Step-by-Step Guide

So, there are 2 ways to create a project in Dataform;
1. Using Command Line: Dataform offers a CLI for project creation; use dataform init <project_name> to set up project directories and config files via the terminal/command prompt.

2. Using the Web Interface: Dataform’s web interface enables project creation and management through a visual dashboard. Users can effortlessly create projects, manage tasks, and define SQL-based data transformations.

Method 1: (Command line)

Step 1: Installation and Project Initialization

  1. Install Dataform CLI: Start by installing Dataform’s command-line tool using npm.
npm install -g @dataform/cli

2. Initialize a New Project: Create a new Dataform project, e.g., ‘sales_analytics.’

dataform init sample_project --default-database YOUR_PROJECT_ID --include-schedules

Step 2: Connect to BigQuery and Set Up Authentication

  1. Create Service Account: Generate a service account on the GCP Console with the BigQuery Admin role and download the JSON key file.
  2. Initialize Credentials: Navigate to your project directory and initialize BigQuery credentials.
cd sample_project
dataform init-creds bigquery

Step 3: Define Your Sources and Transformations

Define Source: Create a source definition in definitions/sources/sales_data.sqlx. Describe the source's details, like database, schema, and table name.

config {
type: "declaration",
database: "your-database",
schema: "your-schema", //Dataset name
name: "sales_table", //Table Name
description: "Sales data source"
}

Transformation: Create a transformation definitions/transformations/sales_summary.sqlx to extract the sales data only for the year 2022 and Revenue > 10000.

config {
type: "view",
schema: "transformed",
description: "Summary of sales data"
}
select * from
(select
product_name,
extract(isoyear from date) as year,
SUM(revenue) as total_revenue
from
${ref("sales_table")})
where year = 2022 and total_revenue > 10000;

Step 4: Compile, Validate, and Run

  1. Compile Project: Compile your Dataform project to ensure correct syntax.
dataform compile
  1. Validate Dependencies: Validate dependencies to understand the transformation flow.
dataform run --dry-run
  1. Run the Transformation: Execute your Dataform project to create the transformed table.
dataform run

Step 5: Web UI and Collaboration

  1. Link to GitHub: Fork a Dataform project template on GitHub and link your GitHub account to Dataform’s web UI.
  2. Import Project: Import your forked project to the Dataform web UI.

Step 6: Building Complex Pipelines

  1. Create a Report: Develop a pipeline to analyze sales data and generate insights. For example, aggregate sales data by product and date.
config {
type: "view",
schema: "reports",
description: "Sales analysis by product and date"
}
select
date,
product_name,
SUM(revenue) as total_revenue
from
${ref("sales_table")}
group by
date, product_name

Method 2: (UI Based)

Step 1: Getting Started with Dataform

Open your Google Cloud Console and navigate to Dataform.

Step 2: Setting Up Your Project

  1. Create a Repository: Begin by creating a repository with any name

2. Workspace Creation: Within your repository, establish a development workspace

Step 3: Defining Sources and Transformations

  1. Sources: Navigate to the workspace and create a source folder under the definitions section. Inside the source folder, define your source files. These files typically reference your raw data sources, providing the foundation for your transformations.
  2. Transformations: Similarly, create a transformation folder under the definitions section. Here, you'll house your transformation script

Step 4: Executing the Transformation

  1. Starting the Execution: You’ll notice an option to choose between a full refresh or incremental update. For this initial run, select the full refresh option to ensure a complete transformation.

Step 5: Completion and Commit

Once the execution is complete and you’re satisfied with the results, commit your changes.
This step makes it easier to collaborate with your team and maintain version control.

Conclusion

With GCP Dataform, the complexities of managing data transformations are simplified. Its modular design, collaboration features, and integration with BigQuery make it a game-changer for modern data engineering. By following this beginner-friendly guide, you’re now equipped to embark on your Dataform journey. Embrace the power of streamlined data transformation and unlock insights with ease!

--

--

Data Engineer | Airflow | Airbyte | DBT | Tableau | SQL | Postgres | Python Developer | BIGQUERY