🔭

Databricks10 min readApril 16, 2026

Stop Checking Workflows Manually — Build a Centralized Databricks Monitoring Dashboard

If you manage 10–15 Databricks workspaces with hundreds of workflows each, clicking through every job to check status is killing your productivity. Here's the exact step-by-step process to build a fully automated monitoring dashboard using the Databricks Jobs API, Delta tables, and Databricks Dashboards.

Let me ask you something honest. How do you currently monitor your Databricks workflows? Do you log into each workspace, click through the Jobs UI, check each workflow one by one, and manually note down what failed, what succeeded, what's still running? If you have two or three workflows, fine. But what if you're managing 10 to 15 Databricks workspaces, and each one is running anywhere from 10 to 500 workflows daily?

That is not monitoring. That is a full-time job just keeping the lights on. I have been there — opening 15 browser tabs, clicking through job run history, cross-referencing timestamps, trying to figure out if a failure in workspace 7 is related to something in workspace 12. It can easily consume your entire morning. And by the time you have finished checking everything, something new has already failed.

The Databricks Jobs REST API exposes everything you need — run status, start and end time, error messages, error codes, direct links to failed runs. Combine that with a scheduled notebook, a Delta table, and a Databricks Dashboard, and you can replace that entire manual process with a single URL that updates itself automatically. This post walks you through exactly how to build it.

💡 This entire setup lives in one central Databricks workspace. One notebook, one Delta table, one dashboard — monitoring every workspace from a single pane of glass.

What You Are Building

A scheduled notebook calls the Databricks Jobs API for every workspace you want to monitor. It pulls the latest run status for every workflow and upserts that data into a Delta table. A Databricks Dashboard reads from that Delta table and displays live KPIs, a status distribution chart, and a clickable failed workflows table — all filterable by business date and source system. Three components, zero manual effort once it is running.

What You Need Before Starting

▸Access to all Databricks workspaces you want to monitor — at minimum 'Can View' permission on Jobs
▸One central Databricks workspace where everything will live — the notebook, the Delta table, and the dashboard
▸Unity Catalog enabled on the central workspace — recommended for clean table access and governance
▸Admin access or the ability to create Secret Scopes for storing API tokens securely

Step 1 — Generate Access Tokens for Each Workspace

Your central monitoring notebook needs to authenticate against each workspace's REST API. There are two ways to do this — Personal Access Tokens for simplicity, or a Service Principal for production-grade security.

For a Personal Access Token, go to each workspace, click your profile icon in the top right, go to Settings, then Developer, then Access Tokens, and generate a new token. Give it a meaningful name like monitoring-service-token and set an expiry of 90 or 365 days. Copy it immediately — you cannot see it again after closing the dialog. Put a calendar reminder to rotate it before expiry.

For a Service Principal — the right choice for any production environment — create a new App Registration in Azure Active Directory, add that service principal to each Databricks workspace under Settings → Identity and Access → Service Principals, and grant it Can View permission on Jobs. A service principal uses short-lived OAuth tokens that are automatically rotated, which is far more secure than long-lived PATs.

⚠️ Never store tokens directly in your notebook code. Anyone with access to the notebook can see them, and they will appear in version history forever.

Once you have your tokens, store them in Databricks Secrets. Go to your central workspace, open Settings, go to Secrets, and create a new Secret Scope — name it something like workspace-monitoring. Add each workspace token as a separate secret key using a consistent naming pattern, for example ws-production-eu, ws-production-us, ws-staging. Your notebook retrieves them at runtime using dbutils.secrets.get() and they never appear in logs or notebook output.

Step 2 — Tag Your Workflows Consistently

This step is easy to skip and you will regret it later. Workflow tags are what power the Source System filter in your dashboard. Without consistent tags, every workflow is just a name with no way to group or filter by team or system.

In each workspace, go to Workflows, click on a workflow, go to Edit, and add tags. The critical thing is to use the exact same tag key names across every workspace — if one workspace uses source_system and another uses SourceSystem, your dashboard filter will not work correctly.

▸source_system — the upstream system the workflow is ingesting from, e.g. SAP, Salesforce, Oracle, MySQL, Snowflake
▸team — which team owns this workflow, e.g. data-engineering, analytics, finance, marketing
▸environment — production, staging, or dev
▸pipeline_type — ingestion, transformation, reporting, or quality-check

If you have hundreds of workflows to tag, do not do it manually one by one. Use the Databricks Jobs API to bulk update tags programmatically — it has a jobs/update endpoint that accepts tag updates. Or coordinate with each team to tag their own workflows since they know best which source system each one belongs to.

Step 3 — Create the Delta Table in Your Central Workspace

In your central Databricks workspace, go to the Data section and create a dedicated catalog for monitoring — something like monitoring or ops. Inside that catalog, create a schema called workflows. Your main table will be monitoring.workflows.run_history.

Decide between a managed table and an external table. Managed is simpler — Unity Catalog handles everything. External is better if you want the underlying Parquet and Delta log files accessible from outside Databricks, for example if you want another team to query this data from Azure Synapse or a BI tool.

Partition the table by business_date and workspace_name. This is important for dashboard performance — when someone filters to today's date, Spark only scans today's partition instead of the entire history. As this table grows over weeks and months, partitioning is what keeps queries fast.

The columns you need in this table are: workspace_name, job_id, run_id, job_name, life_cycle_state, result_state, status, error_message, start_time, end_time, business_date, run_url, tags, and ingested_at. The run_url column is particularly important — it stores the direct URL to each specific run in the Databricks UI, which is what makes your dashboard links clickable.

Step 4 — Build the Monitoring Notebook

Create a new notebook in your central workspace. This notebook does three things: retrieves tokens from Secrets, calls the Jobs API for each workspace, and merges the results into your Delta table.

The Databricks Jobs API endpoint you need is /api/2.1/jobs/runs/list. It accepts the workspace base URL and your bearer token, and returns a paginated list of recent runs with all their metadata including state, timing, error details, tags, and the direct run URL. When deciding how many runs to fetch per workflow, think about your workflow schedules — some workflows may run every 15 minutes, and if your monitoring notebook runs every hour, a single hour window contains 4 runs for that workflow alone. If you only fetch the last 1 or 2 runs you will silently miss the ones in between. Fetching the last 10 to 15 runs per workflow is the safer approach — it ensures that even your most frequently scheduled workflows have all their runs captured between each monitoring execution, with enough buffer for any minor delays or retries.

For each raw run returned by the API, extract the fields that matter: the job name, the run ID, the life cycle state (RUNNING, TERMINATED, SKIPPED), the result state (SUCCESS, FAILED, CANCELED, TIMEDOUT), the error message if any, start and end timestamps, the tags dictionary, and the run page URL. Derive the business_date from the start timestamp — this is the date field your dashboard date filter will use.

Use MERGE when writing to the Delta table, not overwrite. MERGE matches on run_id and workflows_name as the composite key and updates existing records while inserting new ones. This means a run that was RUNNING in the last fetch and is now FAILED gets its status updated correctly in the table without creating a duplicate row.

💡 Use MERGE not INSERT. A workflow that was RUNNING in the 9am fetch will be FAILED or SUCCESS by the 10am fetch. MERGE ensures that single run_id always has the latest and correct status.

Step 5 — Schedule the Notebook as a Workflow

Go to Workflows in your central workspace and create a new job. Set the task to run your monitoring notebook on a cluster — a small single-node cluster is perfectly sufficient since this is just API calls and a Delta merge, not heavy computation.

Set the schedule based on how quickly you need to react to failures. Every hour is a good default for most environments. Every 30 minutes if your workflows are business-critical and failures need immediate attention. Every 15 minutes if you are running near-real-time pipelines. Avoid going below 15 minutes — the API has rate limits and you will start hitting them.

▸Set an email notification on failure for the monitoring workflow itself — if the monitor goes down you need to know
▸Tag this workflow with pipeline_type: monitoring so it is easily identifiable in the Jobs UI
▸Give the cluster a timeout of 15 minutes — if it runs longer than that something is wrong and you want it to fail cleanly rather than hang
▸Enable retries with a 5-minute delay — transient API errors should self-resolve on retry

Step 6 — Build the Dashboard

Go to Dashboards in your central workspace and create a new dashboard. Connect it to your monitoring.workflows.run_history Delta table. Everything you build here reads from that one table.

Global Filters — Add These First

Add global filters before adding any widgets. Global filters apply to every widget on the dashboard simultaneously — changing the date picker updates all KPI cards, the chart, and all tables at once.

▸Business Date — a date picker widget mapped to the business_date column. Set the default value to today's date so the dashboard always opens showing the current day without any manual selection
▸Source System — a dropdown widget populated from the source_system tag values in your data. Selecting SAP filters everything to only SAP-tagged workflows. Teams can own their view without affecting each other
▸Workspace — an optional dropdown to isolate a specific workspace. Useful during incidents when you want to focus on one environment

KPI Counter Cards — Top of the Dashboard

Place five KPI counter cards in a row at the very top of the dashboard. These are the first thing anyone looks at when they open the page — they should tell the complete health story in under five seconds.

▸Total Runs — total number of workflow executions for the selected business date. Gives context for all other numbers
▸Success — count of runs with result_state SUCCESS. Color this green. This should be the biggest number on a healthy day
▸Failed — count of runs with result_state FAILED. Color this red. This is the most important KPI — if it is non-zero, something needs attention
▸Running — count of runs with life_cycle_state RUNNING. Color this blue. Shows what is actively executing right now
▸Canceled — count of runs with result_state CANCELED. Color this amber. Canceled runs are often overlooked but can indicate upstream dependency failures

💡 A healthy dashboard looks like: Total Runs 450, Success 443, Failed 2, Running 5, Canceled 0. Any non-zero Failed count should immediately draw your eye.

Pie Chart — Status Distribution

Below the KPI cards, add a pie chart showing the proportion of each status for the selected business date. Use the same color coding — green for SUCCESS, red for FAILED, blue for RUNNING, amber for CANCELED. On a healthy day the chart should be almost entirely green. A large red slice means something systemic is wrong — possibly a shared dependency like a database connection or a storage account that multiple workflows depend on.

The pie chart is especially useful when presenting to non-technical stakeholders. Rather than explaining numbers, you show a visual where a red slice of 30 percent immediately communicates that something serious needs fixing today.

Failed Workflows Table — Make It Clickable

This is the most operationally valuable component of the entire dashboard. A table showing only FAILED workflows for the selected date with these columns: workflow name, workspace name, source system, error message, start time, duration in minutes, and a direct link to the failed run.

In Databricks Dashboards, set the run_url column to render as a hyperlink with display text like Open Failed Run. Now every row in this table is a one-click portal to the exact failing run in the Jobs UI — no navigating through workspaces, no searching for the job, no scrolling through run history. Click, land directly on the failure, read the full error and stack trace, fix it.

Sort this table by start_time descending so the most recent failures appear at the top. Add a duration column so you can immediately see if a job failed after 2 minutes — likely a configuration issue — versus failing after 3 hours, which suggests it ran successfully for most of its execution and failed on something late like a write or a data quality check.

Currently Running Workflows Table

Add a second table below the failed workflows table showing all workflows currently in RUNNING state. Include the workflow name, workspace, source system, start time, and a calculated column showing how many minutes it has been running. Sort by running duration descending so the longest-running workflows appear at the top.

This table serves two purposes. First, it tells you what is actively executing right now so you do not accidentally trigger a re-run of something that is already in progress. Second, workflows that have been running for an unusually long time — say 3 or 4 times their normal duration — are probably stuck. Spotting them here lets you intervene before they time out and become a failure.

Optional — Runs Over Time Bar Chart

Consider adding a bar chart showing total runs grouped by hour for the selected business date, with bars stacked by status. This gives you a time-of-day view — when do most workflows run, when do failures cluster, are there peak hours where the platform is under heavy load. This is particularly useful for capacity planning and for identifying whether failures correlate with specific time windows like end-of-business batch runs.

How to Use the Dashboard Day to Day

Your morning routine changes completely once this is live. Instead of opening 15 workspace tabs and checking jobs one by one, you open one URL. The dashboard defaults to today's date. You look at the five KPI cards — if Failed is zero, you are done in 10 seconds. If Failed is non-zero, you look at the failed workflows table, read the error messages to triage severity, click the direct links on anything critical, and go straight to the failing run. Total time: under five minutes regardless of how many workspaces you are monitoring.

For historical investigation — say a stakeholder asks why their report was wrong on May 10th — you change the business date picker to May 10th and instantly see every workflow run from that day, what failed, what the error was, and how long each run took. No digging through logs, no asking people what happened.

Things to Add Once the Base Dashboard Is Working

▸Slack or email alert from the notebook itself — when the notebook detects new failures during its run, post a message to a Slack channel with the workflow name, error summary, and direct link. This means failures are pushed to you instead of waiting for someone to check the dashboard
▸SLA tracking — add an expected maximum duration for each workflow and flag runs that exceed it as SLA breaches, separate from outright failures
▸7-day failure trend — a line chart showing daily failure counts for the past week, useful for spotting if failure rates are increasing over time
▸Failure rate by workspace — a bar chart comparing failure percentages across workspaces, immediately shows if one workspace is disproportionately unreliable
▸Most frequently failing workflows — a ranked table of workflows by failure count over the past 30 days. If the same workflow appears at the top every week it needs a permanent fix, not just daily restarts

What This Replaces

Before: 2 to 3 hours every morning opening workspace tabs, scrolling through job run histories, manually noting failures, copying error messages into a spreadsheet, trying to figure out which failures are related. After: open one URL, five minutes, done. The monitoring notebook runs automatically every hour without any human involvement. Every failure is captured, timestamped, tagged by source system, and directly linkable.

The initial setup takes around half a day to a full day depending on how many workspaces you have and how much tagging cleanup is needed. The time savings start from day one and compound every single day. If you are managing more than a handful of Databricks workflows and still checking status manually, this is the most impactful thing you can build this week.

✅ Summary — Databricks Jobs API → Secret Scope for tokens → tag your workflows → Delta table with MERGE → scheduled notebook every hour → Databricks Dashboard with date picker + source system filter + 5 KPI cards + pie chart + clickable failed runs table + running workflows table. One setup, fully automated, covers every workspace from a single URL.

🔥

Oracle V1 to V2 Connector Migration in ADF — The Parquet Precision Bug That Nearly Broke Production

Azure Data Factory · 8 min read

→