Cyber Intelligence
Data and AI on Google Cloud · Innovating with data

L6. Big Data and Analytics: BigQuery, Pub/Sub, Dataflow, and Looker

Video generating

Check back soon for the video lesson on Big Data and Analytics: BigQuery, Pub/Sub, Dataflow, and Looker

Google pioneered big data analytics and built enterprise-scale tools available to all Google Cloud customers. The Digital Leader exam tests BigQuery, Pub/Sub, Dataflow, and Looker for data-driven decision making.

BigQuery

BigQuery is Google Cloud's fully managed, serverless, petabyte-scale data warehouse and analytics engine. Key characteristics:

  • Serverless: no infrastructure to provision or manage
  • Separates storage and compute for independent scaling
  • Built-in ML: BigQuery ML lets you create and train ML models using SQL
  • Real-time analytics with streaming inserts
  • Query pricing: per TB of data scanned (on-demand) or flat-rate slots
  • Data sharing: BigQuery datasets can be shared across projects and organizations
Architecture: BigQuery uses Dremel (columnar storage and tree architecture) for very fast SQL queries across petabytes.

Pub/Sub

Cloud Pub/Sub is a fully managed real-time messaging service for decoupling services that produce messages from services that process them. How it works:

  • Publishers send messages to a topic
  • Subscribers receive messages through subscriptions on that topic
  • Messages are durably stored until delivered
Use for: event ingestion (IoT sensor data, clickstreams), data pipeline triggers, asynchronous communication between microservices.

Dataflow

Cloud Dataflow is a fully managed service for streaming and batch data processing pipelines, built on Apache Beam. Use for: transforming, cleaning, and enriching data before loading it into BigQuery or other storage. Common pattern: Pub/Sub (ingest) → Dataflow (transform) → BigQuery (analyze).

Looker

Looker is Google Cloud's business intelligence (BI) and data visualization platform. Key capabilities:

  • Drag-and-drop data exploration and dashboards
  • Embedded analytics in applications
  • LookML: Looker's modeling language for defining metrics

Dataproc

Cloud Dataproc is a managed Spark and Hadoop service for running existing Apache workloads. Use when: migrating existing Hadoop/Spark workloads to the cloud.

Analytics Pipeline

A typical Google Cloud analytics architecture:

Data Sources → Pub/Sub (ingest) → Dataflow (transform) → BigQuery (warehouse) → Looker (visualize)

ServicePurpose
BigQueryServerless petabyte-scale data warehouse
Pub/SubReal-time message ingestion and decoupling
DataflowStream and batch data processing (Apache Beam)
LookerBusiness intelligence and dashboards
DataprocManaged Spark and Hadoop
Exam tip: BigQuery = the analytics layer. Pub/Sub = message queue for real-time ingestion. Dataflow = transformation pipeline. Looker = visualization.

Exam Focus Points
  • BigQuery is a serverless petabyte-scale data warehouse; storage and compute scale independently
  • BigQuery ML lets you train machine learning models using SQL directly in BigQuery
  • Pub/Sub is a fully managed messaging service for real-time event ingestion and decoupling
  • Dataflow runs Apache Beam pipelines for streaming and batch data transformation
  • Common pipeline: Pub/Sub (ingest) → Dataflow (transform) → BigQuery (analyze) → Looker (visualize)
Knowledge Check

1. A retail company wants to analyze 500 TB of historical sales data using SQL without managing any server infrastructure. Which Google Cloud service is most appropriate?

2. An IoT application generates millions of sensor events per second that must be ingested and processed in real time. Which services form the recommended Google Cloud pipeline?

Recommended: Pluralsight

Reinforce these lessons with Pluralsight's Google Cloud paths: structured video courses, GCP console labs, and practice exams for the Digital Leader certification.

Start Digital Leader prep free10-day free trial available