Group 8: Big Data & Analytics

Analytical platforms spanning data ingestion, transformation, storage, and interactive exploration. Focus on large-scale batch, lakehouse, real-time analytic surfaces, and integrated governance. Azure primary building blocks: Synapse Analytics, Databricks, Azure Data Lake Storage, Data Factory, Azure Data Explorer, HDInsight, and emerging unified experiences like Microsoft Fabric.

Architectural Lens: Distinguish compute engine vs storage substrate vs orchestration pipeline. Avoid single tool bias: compose minimal set achieving latency & cost targets.

Services & Roles

Synapse

Unified lake + DW workspace: SQL (serverless/dedicated), Spark pools, pipelines, integration with ADLS.

Polyglot (SQL, Spark)
On-demand SQL over lake
Integrated security

Databricks

Lakehouse runtime (optimized Spark) + notebooks + Delta Lake ACID tables, MLflow integration.

Delta Lake ACID
Photon acceleration
ML & streaming

ADLS Gen2

Hierarchical namespace object storage base for lakehouse / analytics workloads.

Cheap durable storage
POSIX semantics
Separation of compute

Data Factory

Managed pipelines (ETL/ELT) & dataflows for batch ingestion & transformation scheduling.

Connector breadth
Mapping & wrangling flows
Trigger management

Data Explorer

Fast ad-hoc & time-series / log analytics with Kusto engine.

Sub-second queries
Ingestion batching
Materialized views

HDInsight

Managed OSS clusters (Hadoop/Spark/Kafka). Legacy & lift-tail scenarios.

Full cluster control
OSS parity emphasis
Higher ops overhead

Fabric

Unified SaaS analytics (lake-centric) combining ingestion, engineering, BI, real-time & ML surfaces.

OneLake storage
Low friction SaaS
Tightly integrated BI

Key Differentiators

Dimension	Synapse	Databricks	ADLS	Data Factory	Data Explorer	HDInsight	Fabric
Primary Role	Unified analytics workspace	Lakehouse engine	Storage layer	Orchestration	Time-series / logs	OSS clusters	End-to-end SaaS
Latency Focus	Batch / interactive SQL	Batch / streaming	N/A	Scheduling	Sub-second queries	Batch	Mixed (SaaS)
Compute Elasticity	Pool scaling	Workspace clusters	External	Hosted IR/SSIS	Cluster scale-out	Cluster resize	SaaS managed
ML Integration	Basic Spark ML	MLflow / notebooks	External	External	Limited	Via Spark	Unified (notebooks & models)
Governance Depth	Workspace RBAC	Unity Catalog (if enabled)	POSIX ACL	Pipeline lineage	RBAC / policies	Cluster-level	Central SaaS governance
Streaming Strength	Event-based via Spark	Structured streaming	N/A	Triggers (batch)	Native ingestion	Kafka add-on	Real-time hub
BI Integration	Synapse SQL / Power BI connectors	External (Power BI)	External	External	Dashboards (basic)	External	Tight Power BI

Selection Model

Scoring 0–10. Choose minimal set delivering ingestion → storage → processing → serving. Overlap indicates potential consolidation.

Score_Synapse = 0.22*C_unified + 0.18*C_sql + 0.14*C_mix + 0.14*C_security + 0.12*C_dataVol + 0.10*C_dwh + 0.10*C_integr Score_Databricks = 0.24*C_lakehouse + 0.18*C_stream + 0.16*C_ml + 0.14*C_notebook + 0.12*C_dataVol + 0.10*C_opsFlex + 0.06*(10 - C_sql) Score_ADLS = 0.30*C_dataVol + 0.20*C_cost + 0.15*C_unified + 0.15*C_open + 0.10*(10 - C_latency) + 0.10*C_security Score_DataFctry = 0.30*C_ingest + 0.18*C_scheduling + 0.16*C_connect + 0.14*C_lowcode + 0.12*C_batch + 0.10*(10 - C_stream) Score_DataExplorer=0.28*C_timeseries + 0.20*C_adHoc + 0.16*C_latency + 0.14*C_stream + 0.12*C_logVol + 0.10*(10 - C_batch) Score_HDInsight = 0.28*C_ossParity + 0.20*C_clusterCtrl + 0.16*C_batch + 0.14*C_lift + 0.12*C_dataVol + 0.10*(10 - C_lowops) Score_Fabric = 0.26*C_unified + 0.20*C_lowops + 0.16*C_bi + 0.14*C_collab + 0.12*C_semantic + 0.12*C_realTime

Synapse {{vm.scores.synapse | number:2}}

Databricks {{vm.scores.databricks | number:2}}

ADLS {{vm.scores.adls | number:2}}

Data Factory {{vm.scores.dataf | number:2}}

Data Explorer {{vm.scores.kusto | number:2}}

HDInsight {{vm.scores.hdi | number:2}}

Fabric {{vm.scores.fabric | number:2}}

Recommended Primary Anchor: {{vm.recommended.name}} ({{vm.recommended.score | number:2}})

Heuristics

If C_unified & C_bi high ⇒ Fabric or Synapse (decide by need for SaaS vs deeper Spark / ML integration).
If C_lakehouse + C_stream + C_ml high ⇒ Databricks anchor.
If C_timeseries + C_latency high ⇒ Data Explorer adjunct—not replacement for lake storage.
If C_ingest + connector breadth high but minimal transformation complexity ⇒ Data Factory.
If C_ossParity + cluster control high ⇒ HDInsight (evaluate modernization path).

Anti-Patterns

Duplicating Spark workloads across Synapse + Databricks without clear boundary.
Using HDInsight for greenfield when managed alternatives cover needs.
Relying solely on Data Explorer for long-term cold storage analytics (export to ADLS for tiering).

Summary

Establish a clear storage substrate (ADLS / OneLake), pick an engine anchor (Databricks, Synapse, or Fabric), add specialized surfaces (Data Explorer) only when query latency or time-series shape demands it, and orchestrate ingest with Data Factory unless streaming or code-first pipelines dominate.