How GitLab Duo Agent Platform transforms DataOps

7 months ago 158

Creating dbt models manually is a tedious process that can consume hours of a data engineer's time. Especially when no (big) business transformations are made, it is not the most attractive part of an engineer's work with data.

But what if you could automate this entire process? In this walkthrough, I'll show you exactly how GitLab Duo Agent Platform can generate comprehensive dbt models in just minutes, complete with proper structure, tests, and documentation.

What we're building

Our marketing team wants to effectively manage and optimize advertising investments. One of the advertising platforms is Reddit, so, therefore, we are extracting data from the Reddit Ads API to our enterprise Data Platform Snowflake. At GitLab, we have three layers of storage:

raw layer - first landing point for unprocessed data from external sources; not ready for business use
prep layer - first transformation layer with source models; still not ready for general business use
prod layer - final transformed data ready for business use and Tableau reporting

Chart of storage layers

For this walkthrough, data has already landed in the raw layer by our extraction solution Fivetran, and we'll generate dbt models that handle the data through the prep layer to the prod layer.

Without having to write a single line of dbt code ourselves, by the end of the walkthrough we will have:

Source models in the prep layer
Workspace models in the prod layer
Complete dbt configurations for all 13 tables (which includes 112 columns) in the Reddit Ads dataset
Test queries to validate the outcomes

The entire process will take less than 10 minutes, compared to the hours it would typically require manually. Here are the steps to follow:

1. Prepare the data structure

Before GitLab Duo can generate our models, it needs to understand the complete table structure. The key is running a query against Snowflake's information schema, because we are currently investigating how to connect GitLab Duo via Model Context Protocol (MCP) to our Snowflake instance:

SELECT table_name, column_name, data_type, is_nullable, CASE WHEN is_nullable = 'NO' THEN 'PRIMARY_KEY' ELSE NULL END as key_type FROM raw.information_schema.columns WHERE table_schema = 'REDDIT_ADS' ORDER BY table_name, ordinal_position;

This query captures:

All table and column names
Data types for proper model structure
Nullable constraints
Primary key identification (non-nullable columns in this dataset)

Pro tip: In the Reddit Ads dataset, all non-nullable columns serve as primary keys — a pattern. I validated by checking tables like ad_group, which has two non-nullable columns (account_id and id) that are both marked as primary keys. Running this query returned 112 rows of metadata that I exported as a CSV file for model generation. While this manual step works well today, we're investigating a direct GitLab Duo integration with our Data Platform via MCP to automate this process entirely.

2. Set up GitLab Duo

There are two ways to interact with GitLab Duo:

Web UI chat function
Visual Studio Code plugin

I chose the VS Code plugin because I can run the dbt models locally to test them.

3. Enter the 'magic' prompt

Here's the exact prompt I used to generate all the dbt code:

Create dbt models for all the tables in the file structure.csv. I want to have the source models created, with a filter that dedupes the data based on the primary key. Create these in a new folder reddit_ads. I want to have workspace models created and store these in the workspace_marketing schema. Take this MR as example: [I've referenced to previous source implementation]. Here is the same done for Source A, but now it needs to be done for Reddit Ads. Please check the dbt style guide when creating the code: https://handbook.gitlab.com/handbook/enterprise-data/platform/dbt-guide/

Key elements that made this prompt effective:

Clear specifications for both source and workspace models.
Reference example from a previous similar merge request.
Style guide reference to ensure code quality and consistency.
Specific schema targeting for proper organization.

4. GitLab Duo's process

After submitting the prompt, GitLab Duo got to work. The entire generation process took a few minutes, during which GitLab Duo:

Read and analyzed the CSV input file.
Examined table structures from the metadata.
Referenced our dbt style guide for coding standards.
Took similar merge request into account to properly structure.
Generated source models for all 13 tables.
Created workspace models for all 13 tables.
Generated supporting dbt files:
- sources.yml configuration.
- schema.yml files with tests and documentation.
- Updated dbt_project.yml with schema references.

The results

The output was remarkable:

1 modified file: dbt_project.yml (added reddit_ads schema configuration)
29 new files:
- 26 dbt models (13 source + 13 workspace)
- 3 YAML files
Nearly 900 lines of code generated automatically
Built-in data tests, including unique constraints on primary key columns
Generic descriptions for all models and columns
Proper deduplication logic in source models
Clean, consistent code structure following the GitLab dbt style guide

transform/snowflake-dbt/ ├── dbt_project.yml [MODIFIED] └── models/ ├── sources/ │ └── reddit_ads/ │ ├── reddit_ads_ad_group_source.sql [NEW] │ ├── reddit_ads_ad_source.sql [NEW] │ ├── reddit_ads_business_account_source.sql [NEW] │ ├── reddit_ads_campaign_source.sql [NEW] │ ├── reddit_ads_custom_audience_history_source.sql [NEW] │ ├── reddit_ads_geolocation_source.sql [NEW] │ ├── reddit_ads_interest_source.sql [NEW] │ ├── reddit_ads_targeting_community_source.sql [NEW] │ ├── reddit_ads_targeting_custom_audience_source.sql [NEW] │ ├── reddit_ads_targeting_device_source.sql [NEW] │ ├── reddit_ads_targeting_geolocation_source.sql [NEW] │ ├── reddit_ads_targeting_interest_source.sql [NEW] │ ├── reddit_ads_time_zone_source.sql [NEW] │ ├── schema.yml [NEW] │ └── sources.yml [NEW] └── workspaces/ └── workspace_marketing/ └── reddit_ads/ ├── schema.yml [NEW] ├── wk_reddit_ads_ad.sql [NEW] ├── wk_reddit_ads_ad_group.sql [NEW] ├── wk_reddit_ads_business_account.sql [NEW] ├── wk_reddit_ads_campaign.sql [NEW] ├── wk_reddit_ads_custom_audience_history.sql [NEW] ├── wk_reddit_ads_geolocation.sql [NEW] ├── wk_reddit_ads_interest.sql [NEW] ├── wk_reddit_ads_targeting_community.sql [NEW] ├── wk_reddit_ads_targeting_custom_audience.sql [NEW] ├── wk_reddit_ads_targeting_device.sql [NEW] ├── wk_reddit_ads_targeting_geolocation.sql [NEW] ├── wk_reddit_ads_targeting_interest.sql [NEW] └── wk_reddit_ads_time_zone.sql [NEW]

Sample generated code

Here's an example of the generated code quality. For the time_zone table, GitLab Duo created:

Prep Layer Source Model

WITH source AS ( SELECT * FROM {{ source('reddit_ads','time_zone') }} QUALIFY ROW_NUMBER() OVER (PARTITION BY id ORDER BY _fivetran_synced DESC) = 1 ), renamed AS ( SELECT id::VARCHAR AS time_zone_id, code::VARCHAR AS time_zone_code, dst_offset::NUMBER AS time_zone_dst_offset, is_dst_active::BOOLEAN AS is_time_zone_dst_active, name::VARCHAR AS time_zone_name, offset::NUMBER AS time_zone_offset, _fivetran_synced::TIMESTAMP AS fivetran_synced_at FROM source ) SELECT * FROM renamed

Schema.yml

models: - name: reddit_ads_time_zone_source description: Time zone data from Reddit Ads system columns: - name: time_zone_id description: Unique identifier for time zone records data_tests: - unique - not_null - name: time_zone_code description: Code for the time zone - name: time_zone_dst_offset description: Daylight saving time offset for the time zone - name: is_time_zone_dst_active description: Flag indicating if daylight saving time is active - name: time_zone_name description: Name of the time zone - name: time_zone_offset description: Offset for the time zone - name: fivetran_synced_at description: Timestamp when the record was last synced by Fivetran

Source.yml

sources: - name: reddit_ads database: RAW schema: reddit_ads loaded_at_field: _fivetran_synced loader: fivetran description: Reddit Ads data quoting: database: true schema: false identifier: false tables: - name: time_zone

Workspace Model

WITH source AS ( SELECT * FROM {{ ref('reddit_ads_time_zone_source') }} ) SELECT * FROM source

5. Quality validation

Now that the code looks good, I pushed it to the MR and executed CI test pipeline to test the code and validate the outcome. I asked GitLab Duo to create a validation query:

Create a test query to test the row counts between the raw layer and the workspace layer. Keep in mind that we do deduplication, so we can compare both using distinct on the primary keys.

The AI generated a comprehensive validation query that:

Compared row counts between raw and workspace layers.
Accounted for deduplication logic.
Tested all 13 tables.
Calculated data retention percentages.

Generated SQL Test query -- Reddit Ads Row Count Validation Test -- Compares distinct counts between RAW layer and WORKSPACE_MARKETING layer -- Accounts for deduplication logic in source models WITH raw_counts AS ( -- Single primary key tables SELECT 'ad' AS table_name, COUNT(DISTINCT id) AS raw_count FROM RAW.REDDIT_ADS.AD UNION ALL SELECT 'business_account' AS table_name, COUNT(DISTINCT id) AS raw_count FROM RAW.REDDIT_ADS.BUSINESS_ACCOUNT UNION ALL SELECT 'campaign' AS table_name, COUNT(DISTINCT id) AS raw_count FROM RAW.REDDIT_ADS.CAMPAIGN UNION ALL SELECT 'custom_audience_history' AS table_name, COUNT(DISTINCT id) AS raw_count FROM RAW.REDDIT_ADS.CUSTOM_AUDIENCE_HISTORY UNION ALL SELECT 'geolocation' AS table_name, COUNT(DISTINCT id) AS raw_count FROM RAW.REDDIT_ADS.GEOLOCATION UNION ALL SELECT 'interest' AS table_name, COUNT(DISTINCT id) AS raw_count FROM RAW.REDDIT_ADS.INTEREST UNION ALL SELECT 'time_zone' AS table_name, COUNT(DISTINCT id) AS raw_count FROM RAW.REDDIT_ADS.TIME_ZONE -- Composite primary key tables UNION ALL SELECT 'ad_group' AS table_name, COUNT(DISTINCT CONCAT(account_id, '|', id)) AS raw_count FROM RAW.REDDIT_ADS.AD_GROUP UNION ALL SELECT 'targeting_community' AS table_name, COUNT(DISTINCT CONCAT(ad_group_id, '|', community_id)) AS raw_count FROM RAW.REDDIT_ADS.TARGETING_COMMUNITY UNION ALL SELECT 'targeting_custom_audience' AS table_name, COUNT(DISTINCT CONCAT(ad_group_id, '|', custom_audience_id)) AS raw_count FROM RAW.REDDIT_ADS.TARGETING_CUSTOM_AUDIENCE UNION ALL SELECT 'targeting_device' AS table_name, COUNT(DISTINCT _fivetran_id) AS raw_count FROM RAW.REDDIT_ADS.TARGETING_DEVICE UNION ALL SELECT 'targeting_geolocation' AS table_name, COUNT(DISTINCT CONCAT(ad_group_id, '|', geolocation_id)) AS raw_count FROM RAW.REDDIT_ADS.TARGETING_GEOLOCATION UNION ALL SELECT 'targeting_interest' AS table_name, COUNT(DISTINCT CONCAT(ad_group_id, '|', interest_id)) AS raw_count FROM RAW.REDDIT_ADS.TARGETING_INTEREST ), workspace_counts AS ( -- Workspace layer counts using primary keys from schema.yml SELECT 'ad' AS table_name, COUNT(DISTINCT ad_id) AS workspace_count FROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_AD UNION ALL SELECT 'business_account' AS table_name, COUNT(DISTINCT business_account_id) AS workspace_count FROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_BUSINESS_ACCOUNT UNION ALL SELECT 'campaign' AS table_name, COUNT(DISTINCT campaign_id) AS workspace_count FROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_CAMPAIGN UNION ALL SELECT 'custom_audience_history' AS table_name, COUNT(DISTINCT custom_audience_id) AS workspace_count FROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_CUSTOM_AUDIENCE_HISTORY UNION ALL SELECT 'geolocation' AS table_name, COUNT(DISTINCT geolocation_id) AS workspace_count FROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_GEOLOCATION UNION ALL SELECT 'interest' AS table_name, COUNT(DISTINCT interest_id) AS workspace_count FROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_INTEREST UNION ALL SELECT 'time_zone' AS table_name, COUNT(DISTINCT time_zone_id) AS workspace_count FROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_TIME_ZONE -- Composite primary key tables UNION ALL SELECT 'ad_group' AS table_name, COUNT(DISTINCT CONCAT(ad_group_account_id, '|', ad_group_id)) AS workspace_count FROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_AD_GROUP UNION ALL SELECT 'targeting_community' AS table_name, COUNT(DISTINCT CONCAT(targeting_community_ad_group_id, '|', targeting_community_id)) AS workspace_count FROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_TARGETING_COMMUNITY UNION ALL SELECT 'targeting_custom_audience' AS table_name, COUNT(DISTINCT CONCAT(targeting_custom_audience_ad_group_id, '|', targeting_custom_audience_id)) AS workspace_count FROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_TARGETING_CUSTOM_AUDIENCE UNION ALL SELECT 'targeting_device' AS table_name, COUNT(DISTINCT targeting_device_fivetran_id) AS workspace_count FROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_TARGETING_DEVICE UNION ALL SELECT 'targeting_geolocation' AS table_name, COUNT(DISTINCT CONCAT(targeting_geolocation_ad_group_id, '|', targeting_geolocation_id)) AS workspace_count FROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_TARGETING_GEOLOCATION UNION ALL SELECT 'targeting_interest' AS table_name, COUNT(DISTINCT CONCAT(targeting_interest_ad_group_id, '|', targeting_interest_id)) AS workspace_count FROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_TARGETING_INTEREST ) -- Final comparison with validation results SELECT r.table_name, r.raw_count, w.workspace_count, r.raw_count - w.workspace_count AS count_difference, CASE WHEN r.raw_count = w.workspace_count THEN '✅ PASS' WHEN r.raw_count > w.workspace_count THEN '⚠️ RAW > WORKSPACE (Expected due to deduplication)' ELSE '❌ FAIL - WORKSPACE > RAW (Unexpected)' END AS validation_status, ROUND((w.workspace_count::FLOAT / r.raw_count::FLOAT) * 100, 2) AS data_retention_percentage FROM raw_counts r JOIN workspace_counts w ON r.table_name = w.table_name ORDER BY r.table_name;

query results table

Running this query showed:

Zero differences in row counts after deduplication
100% data retention across all tables
All tests passed successfully

The bottom line: Massive time savings

Traditional approach: 6-8 hours of manual coding, testing, and debugging
GitLab Duo approach: 6-8 minutes of generation + review time

This represents a 60x improvement in developer efficiency (from 6-8 hours to 6-8 minutes), while maintaining high code quality.

Best practices for success

Based on this experience, here are key recommendations:

Prepare your metadata

Extract complete table structures including data types and constraints.
Identify primary keys and relationships upfront.
Export clean, well-formatted CSV input files.

Note: By connecting GitLab Duo via MCP to your (meta)data, you could exclude this manual step.

Provide clear context

Reference existing example MRs when possible.
Specify your coding standards and style guides.
Be explicit about folder structure and naming conventions.

Validate thoroughly

Always create validation queries for data integrity.
Test locally before merging.
Run your CI/CD pipeline to catch any issues.

Leverage AI for follow-up tasks

Generate test queries automatically.
Create documentation templates.
Build validation scripts.

What's next

This demonstration shows how AI-powered development tools like GitLab Duo are also transforming data engineering workflows. The ability to generate hundreds of lines of production-ready code in minutes — complete with tests, documentation, and proper structure — represents a fundamental shift in how we approach repetitive development tasks.

By leveraging AI to handle the repetitive aspects of dbt model creation, data engineers can focus on higher-value activities like data modeling strategy, performance optimization, and business logic implementation.

Ready to try this yourself? Start with a small dataset, prepare your metadata carefully, and watch as GitLab Duo transforms hours of work into minutes of automated generation.

Trial GitLab Duo Agent Platform today.

Read Entire Article