Group 8: Big Data & Analytics

Analytical platforms spanning data ingestion, transformation, storage, and interactive exploration. Focus on large-scale batch, lakehouse, real-time analytic surfaces, and integrated governance. Azure primary building blocks: Synapse Analytics, Databricks, Azure Data Lake Storage, Data Factory, Azure Data Explorer, HDInsight, and emerging unified experiences like Microsoft Fabric.

Architectural Lens: Distinguish compute engine vs storage substrate vs orchestration pipeline. Avoid single tool bias: compose minimal set achieving latency & cost targets.

Services & Roles

Synapse

Unified lake + DW workspace: SQL (serverless/dedicated), Spark pools, pipelines, integration with ADLS.
  • Polyglot (SQL, Spark)
  • On-demand SQL over lake
  • Integrated security

Databricks

Lakehouse runtime (optimized Spark) + notebooks + Delta Lake ACID tables, MLflow integration.
  • Delta Lake ACID
  • Photon acceleration
  • ML & streaming

ADLS Gen2

Hierarchical namespace object storage base for lakehouse / analytics workloads.
  • Cheap durable storage
  • POSIX semantics
  • Separation of compute

Data Factory

Managed pipelines (ETL/ELT) & dataflows for batch ingestion & transformation scheduling.
  • Connector breadth
  • Mapping & wrangling flows
  • Trigger management

Data Explorer

Fast ad-hoc & time-series / log analytics with Kusto engine.
  • Sub-second queries
  • Ingestion batching
  • Materialized views

HDInsight

Managed OSS clusters (Hadoop/Spark/Kafka). Legacy & lift-tail scenarios.
  • Full cluster control
  • OSS parity emphasis
  • Higher ops overhead

Fabric

Unified SaaS analytics (lake-centric) combining ingestion, engineering, BI, real-time & ML surfaces.
  • OneLake storage
  • Low friction SaaS
  • Tightly integrated BI

Key Differentiators

DimensionSynapseDatabricksADLSData FactoryData ExplorerHDInsightFabric
Primary RoleUnified analytics workspaceLakehouse engineStorage layerOrchestrationTime-series / logsOSS clustersEnd-to-end SaaS
Latency FocusBatch / interactive SQLBatch / streamingN/ASchedulingSub-second queriesBatchMixed (SaaS)
Compute ElasticityPool scalingWorkspace clustersExternalHosted IR/SSISCluster scale-outCluster resizeSaaS managed
ML IntegrationBasic Spark MLMLflow / notebooksExternalExternalLimitedVia SparkUnified (notebooks & models)
Governance DepthWorkspace RBACUnity Catalog (if enabled)POSIX ACLPipeline lineageRBAC / policiesCluster-levelCentral SaaS governance
Streaming StrengthEvent-based via SparkStructured streamingN/ATriggers (batch)Native ingestionKafka add-onReal-time hub
BI IntegrationSynapse SQL / Power BI connectorsExternal (Power BI)ExternalExternalDashboards (basic)ExternalTight Power BI

Selection Model

Scoring 0–10. Choose minimal set delivering ingestion → storage → processing → serving. Overlap indicates potential consolidation.

{{c.desc}}
Score_Synapse = 0.22*C_unified + 0.18*C_sql + 0.14*C_mix + 0.14*C_security + 0.12*C_dataVol + 0.10*C_dwh + 0.10*C_integr Score_Databricks = 0.24*C_lakehouse + 0.18*C_stream + 0.16*C_ml + 0.14*C_notebook + 0.12*C_dataVol + 0.10*C_opsFlex + 0.06*(10 - C_sql) Score_ADLS = 0.30*C_dataVol + 0.20*C_cost + 0.15*C_unified + 0.15*C_open + 0.10*(10 - C_latency) + 0.10*C_security Score_DataFctry = 0.30*C_ingest + 0.18*C_scheduling + 0.16*C_connect + 0.14*C_lowcode + 0.12*C_batch + 0.10*(10 - C_stream) Score_DataExplorer=0.28*C_timeseries + 0.20*C_adHoc + 0.16*C_latency + 0.14*C_stream + 0.12*C_logVol + 0.10*(10 - C_batch) Score_HDInsight = 0.28*C_ossParity + 0.20*C_clusterCtrl + 0.16*C_batch + 0.14*C_lift + 0.12*C_dataVol + 0.10*(10 - C_lowops) Score_Fabric = 0.26*C_unified + 0.20*C_lowops + 0.16*C_bi + 0.14*C_collab + 0.12*C_semantic + 0.12*C_realTime
Synapse {{vm.scores.synapse | number:2}}
Databricks {{vm.scores.databricks | number:2}}
ADLS {{vm.scores.adls | number:2}}
Data Factory {{vm.scores.dataf | number:2}}
Data Explorer {{vm.scores.kusto | number:2}}
HDInsight {{vm.scores.hdi | number:2}}
Fabric {{vm.scores.fabric | number:2}}
Recommended Primary Anchor: {{vm.recommended.name}} ({{vm.recommended.score | number:2}})

Heuristics

  • If C_unified & C_bi high ⇒ Fabric or Synapse (decide by need for SaaS vs deeper Spark / ML integration).
  • If C_lakehouse + C_stream + C_ml high ⇒ Databricks anchor.
  • If C_timeseries + C_latency high ⇒ Data Explorer adjunct—not replacement for lake storage.
  • If C_ingest + connector breadth high but minimal transformation complexity ⇒ Data Factory.
  • If C_ossParity + cluster control high ⇒ HDInsight (evaluate modernization path).

Anti-Patterns

  • Duplicating Spark workloads across Synapse + Databricks without clear boundary.
  • Using HDInsight for greenfield when managed alternatives cover needs.
  • Relying solely on Data Explorer for long-term cold storage analytics (export to ADLS for tiering).

Summary

Establish a clear storage substrate (ADLS / OneLake), pick an engine anchor (Databricks, Synapse, or Fabric), add specialized surfaces (Data Explorer) only when query latency or time-series shape demands it, and orchestrate ingest with Data Factory unless streaming or code-first pipelines dominate.