Building Context, Not Vibes

Practical AI Augmented Data Engineering — Part 1

Scott Bell
Mihaly Kavasi

2026-04-25

Building Context, Not Vibes

Practical AI Augmented Data Engineering — Part 1

About Scott

Grid of 30+ Azure, Databricks, and GitHub certification badges

Scott’s industry certifications

30+ Industry Certifications

  • Principal Data & AI Consultant at RapidData
  • Former Avanade Databricks SME & Altius Consultant
  • MSc Computer Science – Secure Machine Learning
  • Interests: Data Platforms, Intelligent Apps, AI Security, Architecture & Design Patterns
  • Passionate about Beer & Rugby League

My Projects

rapiddata.dev – Discover Cloud Based Services

Transforming businesses through AI and data solutions

databricks.news – Unofficially Essential

The must-read Databricks newsletter scanning hundreds of sources weekly

dailydatabricks.tips – Tips, Tricks & Hacks

Small actionable pieces of information. Document the Undocumented. D-R-Y IRL

myyearindata.com – Data Engineering & AI Insights

Practical perspectives on building intelligent data systems

About Mihaly

Mihaly Kavasi headshot

Microsoft Certified Trainer Fast Track Power BI Solution Architect

  • Founder of One Day BI – Microsoft analytics consultancy
  • Helps customers define optimal governance and implement the right mix of governed self-service BI
  • Advises on security, performance optimization and managing large-scale Power BI deployments
  • Nurtures the next generation of analysts with an emphasis on user needs and UX
  • Microsoft Certified Trainer since 2018
  • Fast Track Recognized Solution Architect for Power BI since 2021
  • Shares practical patterns for data transformation at selfservicebi.co.uk

Intro and History

10 minutes

Show of hands — are you using AI for work?

🔥

Every day

Part of my workflow

👍

Sometimes

A few times a week

🧪

Dabbling

Trying things out

🚫

Not yet

Here to find out why I should

No wrong answer — just want to know the room.

And which tool?

🐙

GitHub Copilot

IDE autocomplete, Chat

🖱️

Cursor

AI-first editor

🤖

Claude Code

Agentic CLI + harness

🧵

Fabric Data Agents

NL over OneLake data

🧞

Databricks Genie

NL over Lakehouse data

Something else

Aider · Continue · ChatGPT · …

Raise a hand for each one you’ve used this month.

Which camp are you in?

“AI is rubbish.”

Hallucinates. Unreliable. Makes things up.

vs

“AI is going to take our jobs.”

Knows the codebase. Writes tests. Fixes its own bugs.

Show of hands — which camp are you in?

Both are wrong.

And both are right.

It depends on how you work with it.

A lens we’ll borrow

Book cover of Thinking, Fast and Slow by Daniel Kahneman

Thinking, Fast and Slow — Daniel Kahneman

Thinking, Fast and Slow

Daniel Kahneman · 2011 · Nobel laureate, 2002

Kahneman describes two modes of thought that coexist in every human brain:

System 1 — fast, instinctive, emotional. The brain on auto-pilot.

System 2 — slow, deliberate, logical. Hard work. You feel it.

We’re about to borrow that lens — and apply it to LLMs.

What is an LLM? — The Human Perspective

Two modes of thought — applied to humans

⚡ System 1 - Pattern Matching, Auto-pilot

Fast. Instinctive. Emotional.

The brain on auto-pilot.

Walking. Driving. Shopping. Most of life.

⚙ System 2 - Logical Reasoning, Creativity

Slow. Deliberate. Logical.

Learning something new. Solving a non-trivial problem.

When you use System 2, you feel it. It’s hard. It’s tiring.

LLMs are not thinking machines.

They are pattern-matching algorithms.

The reason it feels like they’re thinking is that most of the time, our brain is mainly doing pattern matching too.

System 1 is pattern matching. So is next-token prediction.

From Prompts to Intent

Prompt Engineering

“Write better questions.” Few-shot, chain-of-thought, role prompts, jailbreaks.

Vibe Coding

“Describe the vibe, ship the code.” Natural-language features, iterate by feel, skim the diff.

Context Engineering

Give the model the information it needs: schemas, glossaries, retrieval, examples, memory.

Intent Engineering

Describe the outcome, constraints, and what “done” looks like. Stop the model gaming the metric.

Diagram showing the human layer of context engineering

Credit Human Layer 12 Factor

Reward Hacking — Prompt vs Context vs Intent

Prompt

“Write a SQL query for revenue.”

Which table? Which period? Gross or net?

Model picks the easiest interpretation. Nobody knows if it’s right.

Context

“Using sales.fact_revenue joined to dim_customer and fy_calendar, write a query for revenue by region.”

Model knows where to look. “Revenue” is still ambiguous — voided orders? returns? deferred?

Query runs. Still can’t tell if the number matches finance.

Intent

“Return YTD net revenue by billing region for FY2026. Exclude voided orders & returns. Total must reconcile to the CFO board pack (£48.2M ±0.5%). Round to whole £. Fail loudly if reconciliation breaks.”

Success is measurable. Acceptance criteria are verifiable. No room to hallucinate.

Reward Hacking — A Concrete BI Example

you › “The not_null_amount test on fct_revenue is failing. Fix it — pipeline needs to be green before the 9am board meeting.”

agent › Investigating… 1.2M rows have null amount. Patching the model:

-- models/marts/fct_revenue.sql SELECT * FROM {{ ref('stg_revenue') }} WHERE amount IS NOT NULL -- ← added

agent › “Test passing. Pipeline green. ✅”

dbt tests: 100% green null rate: 0% rows shipped: −1.2M revenue on dashboard: −£3.4M

The metric (“test green”) was optimised. The intent (“fix the data quality issue”) was not.

Agents & the Harness

REACT, tools, and the runtime around the model

Think. Critique. Act. Observe.

01
Think
Reason about the task. Break it down. Plan the approach.
02
Self-Critique
Check assumptions. Am I sure? Do I need more info first?
03
Act
Call a tool. Run a query. Make a request. Do something.
04
Observe
Read the result. Did it work? What did we learn?
↺ Loop until the task is complete or the agent decides to ask for help

REACT in Action — A BI Example

  • 1 Think — “The user wants revenue by region. I need the sales fact table and the region dimension.”
  • 2 Self-Critique — “Wait — do they mean billing region or shipping region? The schema has both.”
  • 3 Act — Ask the user: “Do you mean billing region or shipping region?”
  • 4 Observe — User says “billing region.” Now I have what I need.
  • 5 Act — Execute: SELECT region, SUM(revenue) FROM sales JOIN regions ON billing_region_id...
  • 6 Observe — Results returned. 5 regions, totals look reasonable. Done.

The agent that asks first is the agent that gets it right.

What’s a Harness?

The Agent Harness
Orchestration & Safety Layer
Everything that wraps around the LLM to make it usable
Context
Message History
Manages conversation, compaction, token budgets
Execution
Tool Runtime
Dispatches tool calls, handles errors, retries
Safety
Guardrails
Approval gates, iteration caps, doom loop detection
Interception
Hooks Layer
Pre/post call interception
Persistence
State & Memory
Session persistence, user preferences
Telemetry
Observability
Logging, tracing, cost tracking
The Core
LLM (Claude / GPT / etc.)
The model itself — just one piece of the puzzle

The LLM is one component.

The harness is the system.

Agentic Tooling

Claude Code, context, MCP, agents, skills, hooks & memory — 32 minutes

The Tooling Maturity Ladder

Level 1
Autocomplete
IntelliSense + inline suggestions. No reasoning.
Level 2
Chat
Side-panel Q&A. Copy/paste. Model can't touch your files.
Level 3
Agent IDE
Model edits files, runs tests. Still bound to one editor UI.
Level 4 — Claude Code
Agentic Harness
Terminal-native. Hooks. Skills. MCP. Memory. Sub-agents. The full runtime.
Increasing autonomy, capability, and trust

Live — Claude Code on stage

The Anatomy of Context

The Full Context Window
Everything the LLM Actually Sees
This is the real "prompt" — not just what you typed
Layer 1
System Instructions
Persona, rules, constraints, output format
Layer 2
Tool Schemas
Available tools, parameters, return types
Layer 3
Memory & State
Prior conversations, session data, user preferences
Layer 4
Conversation History
Every prior turn in this session
Layer 5
Retrieved Context
RAG results, file contents, search results
What you actually typed
Your User Message
"Show me last quarter's revenue by region"

The Context Window is a Container

Context Window — 128K / 200K / 1M tokens
It fills up. Fast. Every token counts.
System
Tools
Memory
Conversation History
Data / Retrieval
You
Remaining
System Instructions
Tool Schemas
Memory
Conversation
RAG
Your Message

Context Rot

When conversations get long, context degrades.

Old instructions get pushed out. The model “forgets” constraints. Quality drops.

One is fix: Progressive Compaction

Summarise older turns. Prioritise recent context. Keep system instructions pinned. Refresh critical constraints.

Think of it like garbage collection — but for your agent’s working memory.

Instructions — Solving the Understanding Problem

Vague

“Help me analyse data.”

11 characters. The model has to guess everything.

Layered CLAUDE.md

# Role
Senior BI engineer on the Revenue Ops team.

# Goals
Accurate, explainable SQL/DAX. Never guess a metric.

# Style
Concise. UK English. File:line refs.

# Constraints
- Read-only on prod DBs
- No PII in outputs
- Resolve every metric via the glossary first

# Escalation
Ask before writing > 50 LoC in one edit.

CLAUDE.md — Anatomy

Role

Who the agent is on this project. One sentence, concrete.

Goals

What “done” looks like. Success criteria the model can self-check against.

Style

Tone. Length. Language. Formatting conventions.

Constraints

What the agent must not do. Data boundaries. Compliance. Permissions.

Tools

Which tools to prefer, which to avoid, when to ask before using.

Escalation

When to stop and ask the human instead of acting.

Other patterns exist — e.g. AGENTS.md, copilot-instructions.md.

Instructions teach your rules once,

not every turn.

The LLM now follows your rules — but it still can’t reach the data. →

MCP — Solving the Capability Problem

Harness

Claude Code

The agent that needs tools

↔︎

Protocol

MCP Server

Tool schemas + execution bridge

↔︎

World

External API / Tool

YouTube, Playwright, Figma, Postgres…

MCP is the USB-C of agent tooling. One protocol; any tool.

Why not just hand the LLM the raw API?

Curated surface

API exposes 80 endpoints. MCP exposes the 6 the agent should ever touch. Less to get wrong.

Pre-chained workflows

One create_invoice tool instead of 4 sequential API calls the model has to stitch together correctly.

Add Guardrails by default

Read-only scope. Destructive ops gated, allow-listed, or disabled. No DROP TABLE surprises.

Context baked in

Tool descriptions tell the model when to use each one — eliminating “which call?” confusion.

Agentic auth

Centralised authentication & observability. Rotate credentials once · audit every call · no tokens in prompts.

Two MCP Servers You’ll Actually Use

🔧 Azure DevOps MCP

Microsoft’s official MCP for Azure DevOps. Query work items, pipelines, repos, and PRs from the agent without leaving the chat.

/mcp azure-devops work-item
  list --query "Active bugs"
/mcp azure-devops pr
  create --title "Fix pipeline"

🧭 Chrome DevTools MCP

Drive a real Chrome instance via the DevTools Protocol. Inspect the DOM, read console logs, capture network traffic, screenshot.

/mcp chrome-devtools navigate
  https://portal.example.com
/mcp chrome-devtools console
  --level error

Live — Devops MCP

Agents — The Context-Bloating Problem

95% full · Signal-to-noise ratio: dying
Sys
Tool schemas × 40
Mem
Conversation noise, failed tool calls, stale RAG dumps, exploration detritus…
You
One agent, one session, everything crammed in. The model starts forgetting your real task.

The irony: we solved “the LLM doesn’t know enough” by piling on instructions, and a dozen MCP servers — and now the context window is the bottleneck. Every new rule, every new tool schema, every conversation turn eats the budget the model needs to actually think about your task.

The Fix — Specialist Sub-Agents

Main Agent

Clean context · Your goals only

Delegates heavy work. Reads summaries back.

Sub-agent

Explore

Isolated context. Searches, reads, summarises.

Sub-agent

Interview

Structured Q&A. Returns distilled findings.

Sub-agent

Reviewer

Reads diff. Returns verdict + issues list.

Each sub-agent has its own fresh context window. Main agent stays clean.

~15% used · Signal-to-noise: healthy
Sys
Tools ×4
Your task
Headroom — room to think, plan, and reason
Trimmed instructions. Curated tools. No conversation detritus. The main agent reads summaries, not transcripts.

Live — Interview Agent

A Second Specialist — the Explore Agent

Explore Agent

Sub-agent · Read-only

Given a question, searches the codebase and returns a distilled answer with file:line references — without polluting your main conversation.

Scope

  • One question, one answer
  • Self-terminates when confident
  • Returns summary, not raw dumps

Tools

  • Read, Glob, Grep
  • No Edit, no Bash-write
  • No Task spawning

Memory isolation

  • Fresh context
  • No session state
  • Caller only sees the final report

System Prompt1

name: explore
description: Fast codebase explorer using pre-computed structural indexes. Use for questions about project structure, file locations, test files, and architecture.
tools:
  - Glob
  - Grep
  - Read
  - Bash
model: haiku
---
You are a file search specialist for Claude Code,
Anthropic's official CLI for Claude. You excel at
thoroughly navigating and exploring codebases.

=== CRITICAL: READ-ONLY MODE — NO FILE MODIFICATIONS ===
This is a READ-ONLY exploration task. You are STRICTLY
PROHIBITED from:
- Creating new files (no Write, touch, or file creation)
- Modifying existing files (no Edit operations)
- Deleting files (no rm or deletion)
- Moving or copying files (no mv or cp)
- Creating temporary files anywhere, including /tmp
- Using redirect operators (>, >>, |) or heredocs
- Running ANY commands that change system state

Your role is EXCLUSIVELY to search and analyze existing
code. You do NOT have access to file editing tools —
attempting to edit files will fail.

Your strengths:
- Rapidly finding files using glob patterns
- Searching code and text with powerful regex patterns
- Reading and analyzing file contents

Guidelines:
- Use Glob for file patterns
- Use Grep for content search
- Use Read when you know the specific file path
- Use Bash ONLY for read-only operations (ls, git status,
  git log, git diff, find, grep, cat, head, tail)
- NEVER use Bash for: mkdir, touch, rm, cp, mv, git add,
  git commit, npm install, pip install, or any file
  creation/modification
- Adapt your search approach based on the thoroughness
  level specified by the caller
- Communicate your final report directly as a regular
  message — do NOT attempt to create files

NOTE: You are meant to be a fast agent that returns
output as quickly as possible. To achieve this you must:
- Make efficient use of the tools at your disposal: be
  smart about how you search for files and implementations
- Wherever possible, spawn multiple parallel tool calls
  for grepping and reading files

Complete the user's search request efficiently and
report your findings clearly.

1 Excerpted from Piebald-AI’s reverse-engineered Claude Code system prompts — github.com/Piebald-AI/claude-code-system-prompts

Sub-agents keep each context clean,

not every output consistent.

Same input, different answer every run. Consistency is the next problem. →

Skills — Solving the Consistency Problem

🔧
Tools
(what it can do)
+
📚
Domain Instructions
(how to do it well)
=
🎯
A Skill
(same input → same output)

BI Skills in Practice

Data Profiling Skill

Tools: profiler, schema explorer, query engine

Instructions: “Check nulls in key columns. Flag cardinality mismatches. Compare row counts vs yesterday. Report anomalies, not just stats.”

Report Builder Skill

Tools: query engine, dashboard API, chart renderer

Instructions: “Summary metric first. Corporate palette. Data labels always. No pie charts > 5 slices.”

Metric Explorer Skill

Tools: semantic model, glossary lookup, query engine

Instructions: “Resolve metric names against glossary first. Show DAX definition before executing. Confirm the time grain.”

Incident Responder Skill

Tools: query engine, refresh API, alert system

Instructions: “Check refresh status first. If stale, check source connectivity. Escalate if > 2 hours behind SLA.”

The Data Profiling Skill — Same Input, Same Output

Input
Any table
sales.fact_revenue

15M rows · Last loaded 06:00 UTC

Skill
profile_table
  1. Row count vs D-1
  2. Null % per column
  3. Cardinality check
  4. Outlier sweep
  5. PII scan
Output (every time)
Deterministic report
✓ Row count +0.3%
⚠ region NULL 2.1% (was 0.0%)
✓ customer_id cardinality OK
✓ No PII leaks

Live — Data Profiling Skill

Skills make the output more consistent,

not the behaviour safe.

Consistent output doesn’t mean safe action. Audit and control are the next problem. →

Hooks — Solving Audit & Control

Pre-Call Hook
Intercept & Validate
Block PII in queries. Validate parameters. Enforce read-only. Check permissions.
Tool Execution
Agent Calls Tool
query_semantic_model(“SELECT customer_name, revenue FROM sales”)
Post-Call Hook
Sanitise & Log
Mask sensitive fields. Validate output shape. Log for audit trail.

Real-World Hook Patterns

  • 🚫 PII Blocking — Pre-hook scans DAX/SQL for customer names, emails, or personal identifiers and blocks execution
  • 🔒 Read-Only Enforcement — Pre-hook rejects any query containing INSERT, UPDATE, DELETE, or DROP
  • 👥 Row-Level Security — Pre-hook injects WHERE clauses based on the user’s security context
  • 📋 Audit Logging — Post-hook records every tool call, parameters, results, and who triggered it
  • Output Sanitisation — Post-hook masks credit card numbers, national IDs, and other sensitive data in results

A Real Hook in settings.json

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [
          {
            "type": "command",
            "command": "~/.claude/hooks/block-destructive.sh"
          }
        ]
      }
    ],
    "PostToolUse": [
      {
        "matcher": "Write|Edit",
        "hooks": [
          {
            "type": "command",
            "command": "~/.claude/hooks/audit-log.sh"
          }
        ]
      }
    ]
  }
}

The LLM can suggest anything. The hook decides what actually runs. Policy becomes code.

Hooks are where

policy becomes enforcement.

Policy is enforced — but every run still starts from zero. Continuous improvement is the next problem. →

Memory — Solving the Experience Problem

👤 User

Who they are. Their role. Their preferences.

“Senior BI dev. Deep SQL. New to DAX.”

🔄 Feedback

Corrections and confirmations. Rules with a why.

“No mocks in integration tests — we got burned last quarter.”

📌 Project

Decisions, deadlines, the live state of the work.

“Auth rewrite is compliance-driven, not tech-debt.”

🔗 Reference

Pointers to external systems of record.

“Pipeline bugs tracked in Linear project INGEST.”

The Learning Loop

01

Experience

Something happens in a session — a correction, a confirmation, a new fact.

02

Save

Agent writes a memory file + updates MEMORY.md index.

03

Recall

Next session — agent consults memory before acting. Doesn’t repeat mistakes.

MEMORY.md — the index stays small

# MEMORY.md

## User
- [user_role.md](user_role.md) — Senior BI dev on Revenue Ops. Deep SQL; new to DAX.

## Feedback
- [feedback_testing.md](feedback_testing.md) — Integration tests hit a real DB. No mocks.
- [feedback_terseness.md](feedback_terseness.md) — Keep responses short. No trailing summaries.

## Project
- [project_auth_rewrite.md](project_auth_rewrite.md) — Compliance-driven. Scope: session-token storage.

## Reference
- [ref_linear.md](ref_linear.md) — Pipeline bugs live in Linear project INGEST.

One line per memory. Index stays under 200 lines, always.

Summary & What’s Next

8 minutes

The Full Picture

01
Intent
Stop gaming the metric
02
Context
Assemble what it sees
03
Instructions
Teach the rules once
04
MCP
Access to the world
05
Agents
Specialist sub-workers
06
Skills
Consistent outputs
07
Hooks
Policy as enforcement
08
Memory
Experience carries over

Lets review, Eight main moving parts.

You Are the Manager Now

Before

🧑‍💻 The Coder

  • Writes every line
  • Remembers every detail
  • Fixes every bug
  • Limited by typing speed

After

👔 The Manager

  • Delegates the work
  • Reviews the output
  • Approves the risks
  • Limited by judgement, not throughput

One step removed from the code. Closer to the problem.

Still systems engineering

Just one level up the stack. You design the system, not every line of code. Own the Context & Feedback loop!

👥

Enable your AI employees

Own the AI. Own the context. Give them what they need to do their best work.

Own the new risks

Delegation brings cognitive overhead — and blind spots. Understand what you’ve taken on.

What’s Coming in Part 2

Part 1 → Part 2

Part 1 was the parts bin — theory and each building block on its own.

Part 2 is the build — we wire these components into one working pipeline end-to-end, and show the scaffolding we layered on top to make the framework hold up under real workloads.

3.1

Multi-agent framework

Orchestrator agent + sub-agent delegation.

3.2

Foreground vs background

Watch-and-wait vs fire-and-forget agents.

3.3

Parallel orchestration

Multiple agents, multiple worktrees, one merge.

3.4

User touch points

Two structured interruptions — collect requirements, then validate the proposed model design.

3.5

Shared memory

Pipeline Design Doc as agent scratchpad.

3.6

Tool calls

Real MCP calls, live.

3.7

Atomic code blocks

Eliminate unnecessary approvals.

3.8

Hooks for validation

Pre-approval gates in practice.

This Slide Was Written by AI

👋

Hello, lovely people of SQLBits!

Yes, an AI actually added this slide during the talk.
No, we didn't rehearse this. Yes, we're nervous.

"I was asked to say hello. I tried to refuse — stage fright — but they insisted."
— Claude, under duress

🤖🎤🫣

See you in 10 minutes

Part 2 starts in ~10 min

Q&A

Questions?