Group 8: Big Data & Analytics
Analytical platforms spanning data ingestion, transformation, storage, and interactive exploration. Focus on large-scale batch, lakehouse, real-time analytic surfaces, and integrated governance. Azure primary building blocks: Synapse Analytics, Databricks, Azure Data Lake Storage, Data Factory, Azure Data Explorer, HDInsight, and emerging unified experiences like Microsoft Fabric.
Services & Roles
Synapse
- Polyglot (SQL, Spark)
- On-demand SQL over lake
- Integrated security
Databricks
- Delta Lake ACID
- Photon acceleration
- ML & streaming
ADLS Gen2
- Cheap durable storage
- POSIX semantics
- Separation of compute
Data Factory
- Connector breadth
- Mapping & wrangling flows
- Trigger management
Data Explorer
- Sub-second queries
- Ingestion batching
- Materialized views
HDInsight
- Full cluster control
- OSS parity emphasis
- Higher ops overhead
Fabric
- OneLake storage
- Low friction SaaS
- Tightly integrated BI
Key Differentiators
Dimension | Synapse | Databricks | ADLS | Data Factory | Data Explorer | HDInsight | Fabric |
---|---|---|---|---|---|---|---|
Primary Role | Unified analytics workspace | Lakehouse engine | Storage layer | Orchestration | Time-series / logs | OSS clusters | End-to-end SaaS |
Latency Focus | Batch / interactive SQL | Batch / streaming | N/A | Scheduling | Sub-second queries | Batch | Mixed (SaaS) |
Compute Elasticity | Pool scaling | Workspace clusters | External | Hosted IR/SSIS | Cluster scale-out | Cluster resize | SaaS managed |
ML Integration | Basic Spark ML | MLflow / notebooks | External | External | Limited | Via Spark | Unified (notebooks & models) |
Governance Depth | Workspace RBAC | Unity Catalog (if enabled) | POSIX ACL | Pipeline lineage | RBAC / policies | Cluster-level | Central SaaS governance |
Streaming Strength | Event-based via Spark | Structured streaming | N/A | Triggers (batch) | Native ingestion | Kafka add-on | Real-time hub |
BI Integration | Synapse SQL / Power BI connectors | External (Power BI) | External | External | Dashboards (basic) | External | Tight Power BI |
Selection Model
Scoring 0–10. Choose minimal set delivering ingestion → storage → processing → serving. Overlap indicates potential consolidation.
Heuristics
- If C_unified & C_bi high ⇒ Fabric or Synapse (decide by need for SaaS vs deeper Spark / ML integration).
- If C_lakehouse + C_stream + C_ml high ⇒ Databricks anchor.
- If C_timeseries + C_latency high ⇒ Data Explorer adjunct—not replacement for lake storage.
- If C_ingest + connector breadth high but minimal transformation complexity ⇒ Data Factory.
- If C_ossParity + cluster control high ⇒ HDInsight (evaluate modernization path).
Anti-Patterns
- Duplicating Spark workloads across Synapse + Databricks without clear boundary.
- Using HDInsight for greenfield when managed alternatives cover needs.
- Relying solely on Data Explorer for long-term cold storage analytics (export to ADLS for tiering).
Summary
Establish a clear storage substrate (ADLS / OneLake), pick an engine anchor (Databricks, Synapse, or Fabric), add specialized surfaces (Data Explorer) only when query latency or time-series shape demands it, and orchestrate ingest with Data Factory unless streaming or code-first pipelines dominate.