Dbt airflow

9/18/2023

This works fine until one day our database is brought down for critical maintenance work that finishes past 23:59. Picture the following example scenario: our company processes sales during the day, and after the work day ends we run a dbt table model to produce a report of the sales of the day:Īssuming our workday ends at 18:00 and no more sales are processed after that, we have until 23:59 to execute our table model as this way all sales of the day will match CURRENT_DATE. This raises a problem when writing dbt models, as we need to find a replacement for datetime functions. The behavior of this function is analogous to functions we can find in pretty much any database, like Postgres’ datetime functions or Redshift’s equivalent. We can avoid this by correctly setting one or more unique_key in the model configuration, in which case dbt will perform an UPSERT 2.Īirflow specifically mentions Python’s datetime.now() function from the standard library when making this point, as it’s a function that every time is executed it will return a different result 3. Incremental: Incremental runs do execute INSERT statements and are the only ones that are at a risk of breaking idempotence by not following this particular recommendation.Full-refreshes: Runs with -full-refresh behave just like when running table materialization models.Incremental models can be further broken down into two:.View and table models are rebuilt every time we dbt run them via CREATE VIEW AS and CREATE TABLE AS statements, followed by the SELECT statement that makes up our model.Ephemeral models are not built in the database, so we can ignore them as no INSERT or any other statement is executed.If we break down dbt models by materialization, we can quickly view that we are upholding this recommendation: Let’s go over Airflow’s three recommendations to ensure idempotence, and briefly describe how they apply to dbt tasks. Progressively backfill incremental models, to avoid long-running queries that may run into database timeouts, or use up too many resources.Įnsuring idempotence when executing dbt tasks in Airflow.As of the time of writing, this functionality is not yet implemented in dbt itself, and there is an active open issue, signaling the relevancy of this problem. Retry dbt tasks safely in the event of transient network errors.In particular, idempotent dbt tasks allow us to: During debugging or reviewing, we can more accurately establish the input and output of the task.Airflow can safely catch-up the task from whichever start_date we specify.Tasks can be retried as many times as we want, without risk of duplicate results.Idempotent tasks have a number of benefits: To illustrate this, I describe a common incremental dbt model pattern and how to alter it to be idempotent.

In this post I argue idempotence should be applied to all Airflow tasks, including tasks that execute dbt. Never read the latest data, but instead read from a specific partition.Avoid volatile functions when executing critical task computations.Replace INSERT with UPSERT, to avoid duplicate rows.To ensure Task idempotence, Airflow gives three recommendations, briefly summarized as: In other words, Airflow tasks that are designed to exploit Airflow’s features ( Airflowic tasks) must be idempotent. This allows Airflow to safely retry a task one or more times in the event of failure (either via an automated or manual trigger). Airflow tasks should be designed like transactions in a database 1, such that executing them always produces the same results.

0 Comments

Dbt airflow

Leave a Reply.

Author

Archives

Categories