Using dbt-expectations as part of a dbt build.
08 Dec 2024The post gives a summary of the different types of data tests applied to the data transformation including dbt-expectations. The content is based on a dbt bootcamp course, with examples and explanations as to what’s being tested and how. The examples in this post are also available in Github:
Why data testing?
From my experience with data transformation projects in the past (e.g. moving data from on prem to the Azure cloud) I’m aware of the challenges of ensuring the quality of data taken from multiple sources into target tables, the transformations at each stage and maintaining this quality continuously in a CI/CD delivery. This complexity makes manual testing onerous (especially given the transformations are likely to be part of an automated pipeline), and quality issues can erode the stakeholder’s confidence in the end data.
In the context of these data testing challenges, Great Expectations.io and the dbt-specific version dbt-expectations are frameworks that enable automated tests to be embedded in data ingestion/transformation pipelines.
The following Udemy ‘bootcamp’ course was an excellent introduction to dbt and its test tools, and the screenshots and material in this post are based on this course:
The Complete dbt (Data Build Tool) Bootcamp:
The boot camp covers the theory and practical application of a data project using snowflake as the data warehouse, and the open source version of dbt. What was particularly relevant for a tester are the sections covering dbt expectations. This post will explain what dbt expectations can do, alongside some practical examples of how it can be applied to a data transformation project.
What is dbt-expectations?
dbt-expectations is an open source python package for dbt based on Great Expectations, and enables integrated tests in data warehouses supported by dbt.
This allows us to extend the coverage of the dbt core (i.e. the built in tests) using a range of tests within the package. The examples below include the built in tests, dbt-expectations tests and custom sql tests (effectively macros). These tests are written in the schema.yml file as per this example in the schema file.
Here is an explanation of what these example tests do, applied to the data transformation example in this project:
Built-in dbt Tests:
- not_null: Ensures that the column doesn't contain null values.
- unique: Verifies that all values in the column are distinct.
- relationships: Checks if a foreign key relationship exists between two columns in different models.
- accepted_values: Ensures that the column only contains specific values from a predefined list.
- positive_value:</b> Verifies that the column values are positive numbers.
Built-in dbt-expectations Tests:
- dbt_expectations. expect_table_row_count_to_equal_other_table: Compares the row count of two tables.
- dbt_expectations.expect_column_values_to_be_of_type: Checks the data type of a column.
- dbt_expectations.expect_column_quantile_values_to_be_between: Verifies that quantile values fall within a specific range.
- dbt_expectations.expect_column_max_to_be_between: Ensures that the maximum value of a column is within a certain range.
Example dbt-expectations test:
To apply dbt expectation tests, the code is added to the schema.yml file , in the example below its used to check column type, expected values (including the quantile value to check values in the table are in an expected range), and a max value. We can also set if a failing test is a warning or an error.
Built-in custom sql Tests:
The third type of dbt test used in this project is a custom sql test.
This simple sql custom test checks the ‘dim_listings_cleansed’ table for any listings with < 1 night.
Custom tests sit outside the dbt core and dbt-expectactions tests and can
extend test coverage to cover edge cases. They are also flexible in enabling ad hoc testing to investigate
scenarios, or to be part of the CI/CD pipeline- see an example of how we can trace the dim_listings_min_nights
custom rest on the data lineage graph in the lineage graph section.
Debugging
For the basic commands on debugging etc. see About dbt debug command.
Running dbt test --debug
command will run all the sql tests against the database connections, the console logs all the test names and the results. However to dig into why a given test failed,
its possible to run the actual sql test against the source table (e.g. in this project in Snowflake) and simplifying the test code to find exactly where it failed- a good approach for a complex failure.
Lineage Graph (Data Flow DAG)
In the section above we’ve looked at practical tests in dbt-expectations which can be embedded in the data transformation pipeline. These tests can be included on a really useful dbt feature, the ‘lineage graph’ alongside the source tables, dimension, fact tables etc. to show where and when the tests run, what table it relates to etc.
Provided test in question is included in the schema.yml and has a description value, it will be included in the correct part of the data transformation flow.
For example, the lineage graph below shows the flow of data in our data warehouse, for instance we can see at a glance that dim_listings_cleansed
is a cleansed dimension table based on the src_listings table
.
By right clicking and checking documentation for dim_listings_cleansed
, we can check all the tests in place for this stage of the transformation, for instance we can tell the the room_type
test checks the type of room as per the description.
For reference the test itself is a built in test in the schema.yml, and while the schema clearly lists all tests its great to be able to visualise where exactly this test sits in the data pipeline, what table(s) it references and we’re able to click through to read its description and code cia the graph. In a data transformation with many sources/transformations this tool would be invaluable.
** Summary **