dlt is a Python library that simplifies extracting and loading data, from a source (such as a database, REST API, or file system) to a destination(like a database or file system). It handles much of the “boring stuff”, like schema evolution, incremental load, and logging. Another great selling point of dlt is that you can run it everywhere you can run a Python script. And here is where things get a bit wild.
One of the things I always loved about working in consulting is the variety, the opportunity to see how different people approach the same problems in different ways. It was a way to learn a lot and quickly. Nowadays, with a bit more experience in what works and what doesn’t, the challenge has shifted to moderation, negotiation, and diplomacy. Because there is still a lot to learn.
Back to dlt. Coming from more structured environments, it is easy to think that the only way to run dlt is to use a “real” scheduler. But not everyone is at that stage, and not everyone needs to be. And that is perfectly fine.
So, how can you actually run dlt? Here are some options.
Running dlt with a cron job
If you are reading this article, you probably know what the cron utility is and how to write a cron expression. Since dlt is a python library, we can run a pipeline directly from the command line:
python3 run_pipeline.py
So, we can easily schedule it with cron:
0 * * * * /usr/bin/python3 /path/to/run_pipeline.py >> /var/log/dlt.log 2>&1
This is not something I would recommend to anyone. the computer running cron needs to be up and running at all times. Second, you won’t receive any notifications if something goes wrong.
✅ Simple & lightweight – no extra infrastructure
✅ Easy setup – just add a cron job
❌ No monitoring – failures disappear like tears in the rain
❌ No retries – if it fails, it… fails 🤷♂️
❌ Runs only on the intern’s machine – they now live in the basement, whispering prayers to the uptime gods
Running dlt with GitHub Actions or GitLab CI
For teams already using GitHub or GitLab, running dlt using the embedded CI workflows provides a more structured way (compared to a cron) to manage data ingestions and transformations. While the primary purpose of these tools is to build and deploy code changes, you can also use them to run arbitrary code on a schedule.
In GitHub Actions, it will look like this:
name: Run DLT Pipeline
on:
schedule:
- cron: "0 2 * * *" # Run daily at 2 AM UTC
jobs:
run-dlt:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run DLT pipeline
run: python run_pipeline.py
In GitLab CI, you first need to define a schedule then you can run dlt with the following configuration:
stages:
- run_dlt
run-dlt:
stage: run_dlt
image: python:3.12
script:
- pip install -r requirements.txt
- python run_pipeline.py
only:
- schedules
Better than a cron, but I am not a fan of this approach either. In my mind, a CI/CD tool should be used for CI/CD pipelines: while it can execute your code, it lacks the flexibility and features of a proper scheduler.
On the other hand, it is a quick and convenient way to setup a data loading process and there are many teams out there doing just this. Who am I to disagree.
✅ Fully automated – runs on schedule or push, no intern sacrifice needed
✅ Version-controlled – so nobody blames you for breaking production (again)
❌ Building pipeline dependencies is complicated – feels like assembling IKEA furniture without instructions
❌ YAML overload – one misplaced space, and your pipeline just… doesn’t
❌ Not the right tool - sooner or later you will hit the ceiling
Running dlt with AWS Step Function + AWS Lamba (or another serverless scheduler/function)
Before jumping into a fully-fledged scheduler, let’s also mention orchestrator tools like AWS Step Functions, GCP Workflows, and Azure Logic Apps. These tools are able to run multi-steps processes (like an ETL job), but lack a built-in scheduler. In fact you have to nudge them to start: in AWS a common pattern is to use AWS EventBridge to start a workflow.
In AWS the setup will look like this:
In a folder called
my_function
, have therun_pipeline.py
and alambda_function.py
like this:
import dlt
from my_pipeline import pipeline # Import your DLT pipeline
def lambda_handler(event, context):
try:
pipeline.run()
return {"status": "success"}
except Exception as e:
return {"status": "failed", "error": str(e)}
In the my_function folder run:
mkdir package
pip install --target ./package dlt
cd package
zip -r ../my_deployment_package.zip .
cd ..
zip my_deployment_package.zip lambda_function.py run_pipeline.py
You can find more information about this process here.
Deploy a Step Function (
dlt-workflow
) using themy_deployment_package.zip
created in the previous step:
(Replace REGION and ACCOUNT_ID with your AWS details)
{
"Comment": "Step Function to Run DLT Pipeline",
"StartAt": "RunDLTPipeline",
"States": {
"RunDLTPipeline": {
"Type": "Task",
"Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:function:dlt-runner",
"Retry": [
{
"ErrorEquals": ["States.TaskFailed"],
"IntervalSeconds": 60,
"MaxAttempts": 3,
"BackoffRate": 2.0
}
],
"End": true
}
}
}
Finally schedule the execution with AWS EventBridge:
aws events put-rule \
--name "DailyDLTTrigger" \
--schedule-expression "cron(0 2 * * ? *)"
aws events put-targets \
--rule "DailyDLTTrigger" \
--targets "[{\"Id\":\"1\",\"Arn\":\"arn:aws:states:REGION:ACCOUNT_ID:stateMachine:dlt-workflow\",\"RoleArn\":\"arn:aws:iam::ACCOUNT_ID:role/EventBridgeExecutionRole\"}]"
In GCP, you can do the same with Cloud Functions + Workflow + Cloud Scheduler. Let me know if you want a detailed breakdown of that setup as well.
Knowing this setup kind of setup is very important for a consultant. While it doesn’t offer all the bells and whistles of a full-fledged scheduler like Airflow, it is very cost-effective, but it can be extended to provide monitoring and alerting in case something doesn’t go as expected.
Moreover, learning how to build and deploy such a pipeline is a great hands-on exercise for getting familiar with cloud tools that you can be applied to many other scenarios.
✅ Scalable – In case you will need more
✅ No infra to manage – servers are someone else’s problem
✅ Cost-efficient – pay only for what you use, looking at you MWAA and Composer.
❌ Requires knowledge of multiple tools – congratulations, you are now the company authority on Step Functions, IAM, and Cloud Logging.
❌ Execution limits – AWS Lambda (15 min), GCP (9 min), or learn about Batch.
❌ Debugging can be an extreme sport – logs across three services (and a secret settings page) 🤠
Running dlt with Airflow
I was planning to continue this article with dlt and Airflow, but I am running out of Larry David GIFs (not true, not possible). That said, Airflow requires a dedicated article, because, off the top of my head, you can run dlt (or any other Python code) in at least three different ways:
Diving into these details will make this article even longer, so I hope you don’t mind if I stop here. Let me know if you want me to continue with Airflow or I should look at something else.
Pretty.... pretty.... pretty good!