On-Premise To Cloud Data Migration
Built a full Medallion Architecture pipeline on Azure that migrated enterprise data from on-premise SQL Server into a scalable cloud data platform. Azure Data Factory handles ingestion into the Bronze layer, Azure Databricks + PySpark handles transformation into Silver and Gold layers, and the final Gold tables power analytics dashboards and business reporting.
Role
Data Engineer
Duration
2 months
Code
🔒 Private RepoThe Problem
The organization had years of enterprise data locked inside on-premise SQL Server databases. Running analytics on it meant querying production systems directly — slowing down live operations and limiting what analysts could do. There was no scalable way to run historical analysis, no separation between operational and analytical workloads, and no cloud flexibility.
⚠️ Querying production SQL Server for analytics was impacting live application performance. The business needed a clean separation between operational and analytical workloads.
Architecture
Architecture Diagram
Data Flow
Tech Stack
Solution — Medallion Architecture
Bronze Layer — Raw Ingestion
Azure Data Factory pipelines extract data from every SQL Server table using a dynamic, parameterized approach — no hardcoded table names. Each pipeline run pulls the full dataset and lands it as-is into the Bronze layer on Azure Data Lake Storage. This is the immutable raw copy — nothing is modified here.
Silver Layer — Cleaned and Validated
Azure Databricks notebooks process Bronze data using PySpark. At this layer: schema is enforced, data types are cast correctly, null values are handled, duplicates are removed, and basic business rules are applied. The output is a clean, trusted dataset ready for modeling.
Gold Layer — Business Ready
Silver data is aggregated and modeled into business-oriented fact and dimension tables in the Gold layer. Views are created on top of these tables so analysts and BI tools always query the latest, correctly-modeled data without touching any upstream layers.
Key Challenges
- ▸Dynamic pipeline design — hardcoding table names would make the pipeline brittle. Built ADF pipelines with parameterized linked services and lookup activities to discover and ingest all tables automatically
- ▸Schema inconsistencies — on-premise SQL Server tables had mixed data types and inconsistent naming conventions. Built a PySpark schema normalization step at the Bronze → Silver boundary
- ▸Incremental loads — full extracts of large tables were slow and expensive. Implemented watermark-based incremental loading in ADF using high-watermark columns on source tables
- ▸Data validation across layers — silent data loss between layers was a risk. Added row count reconciliation checks between Bronze and Silver after every Databricks job
Results
100%
Data Migrated
all on-prem tables moved
0
Production DB Queries
analytics fully decoupled
3
Data Layers Built
Bronze, Silver, Gold
Dynamic
Pipeline Design
no hardcoded table names
Delta
Storage Format
time travel + history enabled
Fast
Reporting Speed
queries hit Gold, not prod DB
✅ The Medallion Architecture gave the business something they never had before — a clear, layered data platform where raw data, clean data, and business data are always separated, always traceable, and never mixed.
Built by
Jeevan Gaire
Data Engineer · 2 months