☁️
SQL ServerSSMSAzure Data FactoryAzure DatabricksPySpark

On-Premise To Cloud Data Migration

Built a full Medallion Architecture pipeline on Azure that migrated enterprise data from on-premise SQL Server into a scalable cloud data platform. Azure Data Factory handles ingestion into the Bronze layer, Azure Databricks + PySpark handles transformation into Silver and Gold layers, and the final Gold tables power analytics dashboards and business reporting.

Role

Data Engineer

Duration

2 months

Code

🔒 Private Repo

The Problem

The organization had years of enterprise data locked inside on-premise SQL Server databases. Running analytics on it meant querying production systems directly — slowing down live operations and limiting what analysts could do. There was no scalable way to run historical analysis, no separation between operational and analytical workloads, and no cloud flexibility.

⚠️ Querying production SQL Server for analytics was impacting live application performance. The business needed a clean separation between operational and analytical workloads.

Architecture

Architecture Diagram

🗄️SQL Server (On-Prem)
🔧SSMS
🔄Azure Data Factory
🪣Bronze Layer
Azure Databricks
🧹Silver Layer
🏆Gold Layer
📊Dashboards

Data Flow

🗄️ SQL Server (On-Prem)── manage ──→🔧 SSMS
🗄️ SQL Server (On-Prem)── extract ──→🔄 Azure Data Factory
🔄 Azure Data Factory── raw load ──→🪣 Bronze Layer
🪣 Bronze Layer── trigger ──→ Azure Databricks
Azure Databricks── clean + validate ──→🧹 Silver Layer
🧹 Silver Layer── aggregate + model ──→🏆 Gold Layer
🏆 Gold Layer── serve ──→📊 Dashboards

Tech Stack

Solution — Medallion Architecture

Bronze Layer — Raw Ingestion

Azure Data Factory pipelines extract data from every SQL Server table using a dynamic, parameterized approach — no hardcoded table names. Each pipeline run pulls the full dataset and lands it as-is into the Bronze layer on Azure Data Lake Storage. This is the immutable raw copy — nothing is modified here.

Silver Layer — Cleaned and Validated

Azure Databricks notebooks process Bronze data using PySpark. At this layer: schema is enforced, data types are cast correctly, null values are handled, duplicates are removed, and basic business rules are applied. The output is a clean, trusted dataset ready for modeling.

Gold Layer — Business Ready

Silver data is aggregated and modeled into business-oriented fact and dimension tables in the Gold layer. Views are created on top of these tables so analysts and BI tools always query the latest, correctly-modeled data without touching any upstream layers.

Key Challenges

  • Dynamic pipeline design — hardcoding table names would make the pipeline brittle. Built ADF pipelines with parameterized linked services and lookup activities to discover and ingest all tables automatically
  • Schema inconsistencies — on-premise SQL Server tables had mixed data types and inconsistent naming conventions. Built a PySpark schema normalization step at the Bronze → Silver boundary
  • Incremental loads — full extracts of large tables were slow and expensive. Implemented watermark-based incremental loading in ADF using high-watermark columns on source tables
  • Data validation across layers — silent data loss between layers was a risk. Added row count reconciliation checks between Bronze and Silver after every Databricks job

Results

100%

Data Migrated

all on-prem tables moved

0

Production DB Queries

analytics fully decoupled

3

Data Layers Built

Bronze, Silver, Gold

Dynamic

Pipeline Design

no hardcoded table names

Delta

Storage Format

time travel + history enabled

Fast

Reporting Speed

queries hit Gold, not prod DB

✅ The Medallion Architecture gave the business something they never had before — a clear, layered data platform where raw data, clean data, and business data are always separated, always traceable, and never mixed.

Built by

Jeevan Gaire

Data Engineer · 2 months

🔒 Private Repository
Hire Me