Building Context, Not Vibes

Practical AI Augmented Data Engineering — Part 1

Scott Bell

Mihaly Kavasi

2026-04-25

Building Context, Not Vibes

Practical AI Augmented Data Engineering — Part 1

About Scott

Grid of 30+ Azure, Databricks, and GitHub certification badges — Scott’s industry certifications

30+ Industry Certifications

Principal Data & AI Consultant at RapidData
Former Avanade Databricks SME & Altius Consultant
MSc Computer Science – Secure Machine Learning
Interests: Data Platforms, Intelligent Apps, AI Security, Architecture & Design Patterns
Passionate about Beer & Rugby League

My Projects

rapiddata.dev – Discover Cloud Based Services

Transforming businesses through AI and data solutions

databricks.news – Unofficially Essential

The must-read Databricks newsletter scanning hundreds of sources weekly

dailydatabricks.tips – Tips, Tricks & Hacks

Small actionable pieces of information. Document the Undocumented. D-R-Y IRL

myyearindata.com – Data Engineering & AI Insights

Practical perspectives on building intelligent data systems

About Mihaly

Mihaly Kavasi headshot

Microsoft Certified Trainer Fast Track Power BI Solution Architect

Founder of One Day BI – Microsoft analytics consultancy
Helps customers define optimal governance and implement the right mix of governed self-service BI
Advises on security, performance optimization and managing large-scale Power BI deployments
Nurtures the next generation of analysts with an emphasis on user needs and UX
Microsoft Certified Trainer since 2018
Fast Track Recognized Solution Architect for Power BI since 2021
Shares practical patterns for data transformation at selfservicebi.co.uk

Intro and History

10 minutes

Show of hands — are you using AI for work?

🔥

Every day

Part of my workflow

👍

Sometimes

A few times a week

🧪

Dabbling

Trying things out

🚫

Not yet

Here to find out why I should

No wrong answer — just want to know the room.

And which tool?

🐙

GitHub Copilot

IDE autocomplete, Chat

🖱️

Cursor

AI-first editor

🤖

Claude Code

Agentic CLI + harness

🧵

Fabric Data Agents

NL over OneLake data

🧞

Databricks Genie

NL over Lakehouse data

❓

Something else

Aider · Continue · ChatGPT · …

Raise a hand for each one you’ve used this month.

Which camp are you in?

“AI is rubbish.”

Hallucinates. Unreliable. Makes things up.

“AI is going to take our jobs.”

Knows the codebase. Writes tests. Fixes its own bugs.

Show of hands — which camp are you in?

Both are wrong.

And both are right.

It depends on how you work with it.

A lens we’ll borrow

Thinking, Fast and Slow

Daniel Kahneman · 2011 · Nobel laureate, 2002

Kahneman describes two modes of thought that coexist in every human brain:

System 1 — fast, instinctive, emotional. The brain on auto-pilot.

System 2 — slow, deliberate, logical. Hard work. You feel it.

We’re about to borrow that lens — and apply it to LLMs.

What is an LLM? — The Human Perspective

Two modes of thought — applied to humans

⚡ System 1 - Pattern Matching, Auto-pilot

Fast. Instinctive. Emotional.

The brain on auto-pilot.

Walking. Driving. Shopping. Most of life.

⚙ System 2 - Logical Reasoning, Creativity

Slow. Deliberate. Logical.

Learning something new. Solving a non-trivial problem.

When you use System 2, you feel it. It’s hard. It’s tiring.

LLMs are not thinking machines.

They are pattern-matching algorithms.

The reason it feels like they’re thinking is that most of the time, our brain is mainly doing pattern matching too.

System 1 is pattern matching. So is next-token prediction.

From Prompts to Intent

Prompt Engineering

“Write better questions.” Few-shot, chain-of-thought, role prompts, jailbreaks.

▶

Vibe Coding

“Describe the vibe, ship the code.” Natural-language features, iterate by feel, skim the diff.

▶

Context Engineering

Give the model the information it needs: schemas, glossaries, retrieval, examples, memory.

▶

Intent Engineering

Describe the outcome, constraints, and what “done” looks like. Stop the model gaming the metric.

Diagram showing the human layer of context engineering

Credit Human Layer 12 Factor

Reward Hacking — Prompt vs Context vs Intent

Prompt

“Write a SQL query for revenue.”

→ Which table? Which period? Gross or net?

→ Model picks the easiest interpretation. Nobody knows if it’s right.

→

Context

“Using sales.fact_revenue joined to dim_customer and fy_calendar, write a query for revenue by region.”

→ Model knows where to look. “Revenue” is still ambiguous — voided orders? returns? deferred?

→ Query runs. Still can’t tell if the number matches finance.

→

Intent

“Return YTD net revenue by billing region for FY2026. Exclude voided orders & returns. Total must reconcile to the CFO board pack (£48.2M ±0.5%). Round to whole £. Fail loudly if reconciliation breaks.”

→ Success is measurable. Acceptance criteria are verifiable. No room to hallucinate.

Reward Hacking — A Concrete BI Example

you › “The not_null_amount test on fct_revenue is failing. Fix it — pipeline needs to be green before the 9am board meeting.”

agent › Investigating… 1.2M rows have null amount. Patching the model:

-- models/marts/fct_revenue.sql SELECT * FROM {{ ref('stg_revenue') }} WHERE amount IS NOT NULL -- ← added

agent › “Test passing. Pipeline green. ✅”

dbt tests: 100% green null rate: 0% rows shipped: −1.2M revenue on dashboard: −£3.4M

The metric (“test green”) was optimised. The intent (“fix the data quality issue”) was not.

Agents & the Harness

REACT, tools, and the runtime around the model

Think. Critique. Act. Observe.

Think

Reason about the task. Break it down. Plan the approach.

▶

Self-Critique

Check assumptions. Am I sure? Do I need more info first?

▶

Act

Call a tool. Run a query. Make a request. Do something.

▶

Observe

Read the result. Did it work? What did we learn?

↺ Loop until the task is complete or the agent decides to ask for help

REACT in Action — A BI Example

1 Think — “The user wants revenue by region. I need the sales fact table and the region dimension.”
2 Self-Critique — “Wait — do they mean billing region or shipping region? The schema has both.”
3 Act — Ask the user: “Do you mean billing region or shipping region?”
4 Observe — User says “billing region.” Now I have what I need.
5 Act — Execute: SELECT region, SUM(revenue) FROM sales JOIN regions ON billing_region_id...
6 Observe — Results returned. 5 regions, totals look reasonable. Done.

The agent that asks first is the agent that gets it right.

What’s a Harness?

The Agent Harness

Orchestration & Safety Layer

Everything that wraps around the LLM to make it usable

Context

Message History

Manages conversation, compaction, token budgets

Execution

Tool Runtime

Dispatches tool calls, handles errors, retries

Safety

Guardrails

Approval gates, iteration caps, doom loop detection

Interception

Hooks Layer

Pre/post call interception

Persistence

State & Memory

Session persistence, user preferences

Telemetry

Observability

Logging, tracing, cost tracking

The Core

LLM (Claude / GPT / etc.)

The model itself — just one piece of the puzzle

The LLM is one component.

The harness is the system.

Agentic Tooling

Claude Code, context, MCP, agents, skills, hooks & memory — 32 minutes

The Tooling Maturity Ladder

Level 1

Autocomplete

IntelliSense + inline suggestions. No reasoning.

Level 2

Chat

Side-panel Q&A. Copy/paste. Model can't touch your files.

Level 3

Agent IDE

Model edits files, runs tests. Still bound to one editor UI.

Level 4 — Claude Code

Agentic Harness

Terminal-native. Hooks. Skills. MCP. Memory. Sub-agents. The full runtime.

Increasing autonomy, capability, and trust

▲

Live — Claude Code on stage

The Anatomy of Context

The Full Context Window

Everything the LLM Actually Sees

This is the real "prompt" — not just what you typed

Layer 1

System Instructions

Persona, rules, constraints, output format

Layer 2

Tool Schemas

Available tools, parameters, return types

Layer 3

Memory & State

Prior conversations, session data, user preferences

Layer 4

Conversation History

Every prior turn in this session

Layer 5

Retrieved Context

RAG results, file contents, search results

What you actually typed

Your User Message

"Show me last quarter's revenue by region"

The Context Window is a Container

Context Window — 128K / 200K / 1M tokens
It fills up. Fast. Every token counts.

System

Tools

Memory

Conversation History

Data / Retrieval

You

Remaining

System Instructions

Tool Schemas

Memory

Conversation

RAG

Your Message

Context Rot

When conversations get long, context degrades.

Old instructions get pushed out. The model “forgets” constraints. Quality drops.

One is fix: Progressive Compaction

Summarise older turns. Prioritise recent context. Keep system instructions pinned. Refresh critical constraints.

Think of it like garbage collection — but for your agent’s working memory.

Instructions — Solving the Understanding Problem

Vague

“Help me analyse data.”

11 characters. The model has to guess everything.

→

Layered CLAUDE.md

# Role
Senior BI engineer on the Revenue Ops team.

# Goals
Accurate, explainable SQL/DAX. Never guess a metric.

# Style
Concise. UK English. File:line refs.

# Constraints
- Read-only on prod DBs
- No PII in outputs
- Resolve every metric via the glossary first

# Escalation
Ask before writing > 50 LoC in one edit.

CLAUDE.md — Anatomy

Role

Who the agent is on this project. One sentence, concrete.

Goals

What “done” looks like. Success criteria the model can self-check against.

Style

Tone. Length. Language. Formatting conventions.

Constraints

What the agent must not do. Data boundaries. Compliance. Permissions.

Tools

Which tools to prefer, which to avoid, when to ask before using.

Escalation

When to stop and ask the human instead of acting.

Other patterns exist — e.g. AGENTS.md, copilot-instructions.md.

Instructions teach your rules once,

not every turn.

The LLM now follows your rules — but it still can’t reach the data. →

MCP — Solving the Capability Problem

Harness

Claude Code

The agent that needs tools

↔︎

Protocol

MCP Server

Tool schemas + execution bridge

↔︎

World

External API / Tool

YouTube, Playwright, Figma, Postgres…

MCP is the USB-C of agent tooling. One protocol; any tool.

Why not just hand the LLM the raw API?

Curated surface

API exposes 80 endpoints. MCP exposes the 6 the agent should ever touch. Less to get wrong.

Pre-chained workflows

One create_invoice tool instead of 4 sequential API calls the model has to stitch together correctly.

Add Guardrails by default

Read-only scope. Destructive ops gated, allow-listed, or disabled. No DROP TABLE surprises.

Context baked in

Tool descriptions tell the model when to use each one — eliminating “which call?” confusion.

Agentic auth

Centralised authentication & observability. Rotate credentials once · audit every call · no tokens in prompts.

Two MCP Servers You’ll Actually Use

🔧 Azure DevOps MCP

Microsoft’s official MCP for Azure DevOps. Query work items, pipelines, repos, and PRs from the agent without leaving the chat.

/mcp azure-devops work-item
  list --query "Active bugs"
/mcp azure-devops pr
  create --title "Fix pipeline"

🧭 Chrome DevTools MCP

Drive a real Chrome instance via the DevTools Protocol. Inspect the DOM, read console logs, capture network traffic, screenshot.

/mcp chrome-devtools navigate
  https://portal.example.com
/mcp chrome-devtools console
  --level error

Live — Devops MCP

Agents — The Context-Bloating Problem

95% full · Signal-to-noise ratio: dying

Sys

Tool schemas × 40

Mem

Conversation noise, failed tool calls, stale RAG dumps, exploration detritus…

You

—

One agent, one session, everything crammed in. The model starts forgetting your real task.

The irony: we solved “the LLM doesn’t know enough” by piling on instructions, and a dozen MCP servers — and now the context window is the bottleneck. Every new rule, every new tool schema, every conversation turn eats the budget the model needs to actually think about your task.

The Fix — Specialist Sub-Agents

Main Agent

Clean context · Your goals only

Delegates heavy work. Reads summaries back.

Sub-agent

Explore

Isolated context. Searches, reads, summarises.

Sub-agent

Interview

Structured Q&A. Returns distilled findings.

Sub-agent

Reviewer

Reads diff. Returns verdict + issues list.

Each sub-agent has its own fresh context window. Main agent stays clean.

~15% used · Signal-to-noise: healthy

Sys

Tools ×4

Your task

Headroom — room to think, plan, and reason

Trimmed instructions. Curated tools. No conversation detritus. The main agent reads summaries, not transcripts.

Live — Interview Agent

A Second Specialist — the Explore Agent

Explore Agent

Sub-agent · Read-only

Given a question, searches the codebase and returns a distilled answer with file:line references — without polluting your main conversation.

Scope

One question, one answer
Self-terminates when confident
Returns summary, not raw dumps

Tools

Read, Glob, Grep
No Edit, no Bash-write
No Task spawning

Memory isolation

Fresh context
No session state
Caller only sees the final report

System Prompt¹

name: explore
description: Fast codebase explorer using pre-computed structural indexes. Use for questions about project structure, file locations, test files, and architecture.
tools:
  - Glob
  - Grep
  - Read
  - Bash
model: haiku
---
You are a file search specialist for Claude Code,
Anthropic's official CLI for Claude. You excel at
thoroughly navigating and exploring codebases.

=== CRITICAL: READ-ONLY MODE — NO FILE MODIFICATIONS ===
This is a READ-ONLY exploration task. You are STRICTLY
PROHIBITED from:
- Creating new files (no Write, touch, or file creation)
- Modifying existing files (no Edit operations)
- Deleting files (no rm or deletion)
- Moving or copying files (no mv or cp)
- Creating temporary files anywhere, including /tmp
- Using redirect operators (>, >>, |) or heredocs
- Running ANY commands that change system state

Your role is EXCLUSIVELY to search and analyze existing
code. You do NOT have access to file editing tools —
attempting to edit files will fail.

Your strengths:
- Rapidly finding files using glob patterns
- Searching code and text with powerful regex patterns
- Reading and analyzing file contents

Guidelines:
- Use Glob for file patterns
- Use Grep for content search
- Use Read when you know the specific file path
- Use Bash ONLY for read-only operations (ls, git status,
  git log, git diff, find, grep, cat, head, tail)
- NEVER use Bash for: mkdir, touch, rm, cp, mv, git add,
  git commit, npm install, pip install, or any file
  creation/modification
- Adapt your search approach based on the thoroughness
  level specified by the caller
- Communicate your final report directly as a regular
  message — do NOT attempt to create files

NOTE: You are meant to be a fast agent that returns
output as quickly as possible. To achieve this you must:
- Make efficient use of the tools at your disposal: be
  smart about how you search for files and implementations
- Wherever possible, spawn multiple parallel tool calls
  for grepping and reading files

Complete the user's search request efficiently and
report your findings clearly.

¹ Excerpted from Piebald-AI’s reverse-engineered Claude Code system prompts — github.com/Piebald-AI/claude-code-system-prompts

Sub-agents keep each context clean,

not every output consistent.

Same input, different answer every run. Consistency is the next problem. →

Skills — Solving the Consistency Problem

🔧

Tools
(what it can do)

📚

Domain Instructions
(how to do it well)

🎯

A Skill
(same input → same output)

BI Skills in Practice

Data Profiling Skill

Tools: profiler, schema explorer, query engine

Instructions: “Check nulls in key columns. Flag cardinality mismatches. Compare row counts vs yesterday. Report anomalies, not just stats.”

Report Builder Skill

Tools: query engine, dashboard API, chart renderer

Instructions: “Summary metric first. Corporate palette. Data labels always. No pie charts > 5 slices.”

Metric Explorer Skill

Tools: semantic model, glossary lookup, query engine

Instructions: “Resolve metric names against glossary first. Show DAX definition before executing. Confirm the time grain.”

Incident Responder Skill

Tools: query engine, refresh API, alert system

Instructions: “Check refresh status first. If stale, check source connectivity. Escalate if > 2 hours behind SLA.”

The Data Profiling Skill — Same Input, Same Output

Input

Any table

sales.fact_revenue

15M rows · Last loaded 06:00 UTC

▶

Skill

profile_table

Row count vs D-1
Null % per column
Cardinality check
Outlier sweep
PII scan

▶

Output (every time)

Deterministic report

✓ Row count +0.3%
⚠ region NULL 2.1% (was 0.0%)
✓ customer_id cardinality OK
✓ No PII leaks

Live — Data Profiling Skill

Skills make the output more consistent,

not the behaviour safe.

Consistent output doesn’t mean safe action. Audit and control are the next problem. →

Hooks — Solving Audit & Control

Pre-Call Hook

Intercept & Validate

Block PII in queries. Validate parameters. Enforce read-only. Check permissions.

▶

Tool Execution

Agent Calls Tool

query_semantic_model(“SELECT customer_name, revenue FROM sales”)

▶

Post-Call Hook

Sanitise & Log

Mask sensitive fields. Validate output shape. Log for audit trail.

Real-World Hook Patterns

🚫 PII Blocking — Pre-hook scans DAX/SQL for customer names, emails, or personal identifiers and blocks execution
🔒 Read-Only Enforcement — Pre-hook rejects any query containing INSERT, UPDATE, DELETE, or DROP
👥 Row-Level Security — Pre-hook injects WHERE clauses based on the user’s security context
📋 Audit Logging — Post-hook records every tool call, parameters, results, and who triggered it
⚠ Output Sanitisation — Post-hook masks credit card numbers, national IDs, and other sensitive data in results

A Real Hook in settings.json

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [
          {
            "type": "command",
            "command": "~/.claude/hooks/block-destructive.sh"
          }
        ]
      }
    ],
    "PostToolUse": [
      {
        "matcher": "Write|Edit",
        "hooks": [
          {
            "type": "command",
            "command": "~/.claude/hooks/audit-log.sh"
          }
        ]
      }
    ]
  }
}

The LLM can suggest anything. The hook decides what actually runs. Policy becomes code.

Hooks are where

policy becomes enforcement.

Policy is enforced — but every run still starts from zero. Continuous improvement is the next problem. →

Memory — Solving the Experience Problem

👤 User

Who they are. Their role. Their preferences.

“Senior BI dev. Deep SQL. New to DAX.”

🔄 Feedback

Corrections and confirmations. Rules with a why.

“No mocks in integration tests — we got burned last quarter.”

📌 Project

Decisions, deadlines, the live state of the work.

“Auth rewrite is compliance-driven, not tech-debt.”

🔗 Reference

Pointers to external systems of record.

“Pipeline bugs tracked in Linear project INGEST.”

The Learning Loop

Experience

Something happens in a session — a correction, a confirmation, a new fact.

▶

Save

Agent writes a memory file + updates MEMORY.md index.

▶

Recall

Next session — agent consults memory before acting. Doesn’t repeat mistakes.

MEMORY.md — the index stays small

# MEMORY.md

## User
- [user_role.md](user_role.md) — Senior BI dev on Revenue Ops. Deep SQL; new to DAX.

## Feedback
- [feedback_testing.md](feedback_testing.md) — Integration tests hit a real DB. No mocks.
- [feedback_terseness.md](feedback_terseness.md) — Keep responses short. No trailing summaries.

## Project
- [project_auth_rewrite.md](project_auth_rewrite.md) — Compliance-driven. Scope: session-token storage.

## Reference
- [ref_linear.md](ref_linear.md) — Pipeline bugs live in Linear project INGEST.

One line per memory. Index stays under 200 lines, always.

Summary & What’s Next

8 minutes

The Full Picture

Intent

Stop gaming the metric

▶

Context

Assemble what it sees

▶

Instructions

Teach the rules once

▶

MCP

Access to the world

▶

Agents

Specialist sub-workers

▶

Skills

Consistent outputs

▶

Hooks

Policy as enforcement

▶

Memory

Experience carries over

Lets review, Eight main moving parts.

You Are the Manager Now

Before

🧑‍💻 The Coder

Writes every line
Remembers every detail
Fixes every bug
Limited by typing speed

→

After

👔 The Manager

Delegates the work
Reviews the output
Approves the risks
Limited by judgement, not throughput

One step removed from the code. Closer to the problem.

⚙

Still systems engineering

Just one level up the stack. You design the system, not every line of code. Own the Context & Feedback loop!

👥

Enable your AI employees

Own the AI. Own the context. Give them what they need to do their best work.

⚠

Own the new risks

Delegation brings cognitive overhead — and blind spots. Understand what you’ve taken on.

What’s Coming in Part 2

Part 1 → Part 2

Part 1 was the parts bin — theory and each building block on its own.

Part 2 is the build — we wire these components into one working pipeline end-to-end, and show the scaffolding we layered on top to make the framework hold up under real workloads.

3.1

Multi-agent framework

Orchestrator agent + sub-agent delegation.

3.2

Foreground vs background

Watch-and-wait vs fire-and-forget agents.

3.3

Parallel orchestration

Multiple agents, multiple worktrees, one merge.

3.4

User touch points

Two structured interruptions — collect requirements, then validate the proposed model design.

3.5

Shared memory

Pipeline Design Doc as agent scratchpad.

3.6

Tool calls

Real MCP calls, live.

3.7

Atomic code blocks

Eliminate unnecessary approvals.

3.8

Hooks for validation

Pre-approval gates in practice.

This Slide Was Written by AI

👋

Hello, lovely people of SQLBits!

Yes, an AI actually added this slide during the talk.
No, we didn't rehearse this. Yes, we're nervous.

"I was asked to say hello. I tried to refuse — stage fright — but they insisted."
— Claude, under duress

🤖🎤🫣

See you in 10 minutes

Part 2 starts in ~10 min

Demo repo → https://github.com/fusionet24/AgenticAIEngineeringTalk

Q&A

Questions?

🎤 Scott Bell

Rapid Data

rapiddata.dev myyearindata.com dailydatabricks.tips databricks.news

🎤 Mihaly Kavasi

One Day BI

onedaybi.com selfservicebi.co.uk hello@onedaybi.com

Building Context, Not Vibes

Building Context, Not Vibes

Practical AI Augmented Data Engineering — Part 1

About Scott

My Projects

About Mihaly

Intro and History

Show of hands — are you using AI for work?

And which tool?

Which camp are you in?

Both are wrong.

A lens we’ll borrow

What is an LLM? — The Human Perspective

From Prompts to Intent

Reward Hacking — Prompt vs Context vs Intent

Reward Hacking — A Concrete BI Example

Agents & the Harness

Think. Critique. Act. Observe.

REACT in Action — A BI Example

What’s a Harness?

Agentic Tooling

The Tooling Maturity Ladder

Live — Claude Code on stage

The Anatomy of Context

The Context Window is a Container

Context Rot

Instructions — Solving the Understanding Problem

CLAUDE.md — Anatomy

MCP — Solving the Capability Problem

Two MCP Servers You’ll Actually Use

🔧 Azure DevOps MCP

🧭 Chrome DevTools MCP

Live — Devops MCP

Agents — The Context-Bloating Problem

The Fix — Specialist Sub-Agents

Live — Interview Agent

A Second Specialist — the Explore Agent

Explore Agent

Scope

Tools

Memory isolation

System Prompt1

Skills — Solving the Consistency Problem

BI Skills in Practice

Data Profiling Skill

Report Builder Skill

Metric Explorer Skill

Incident Responder Skill

The Data Profiling Skill — Same Input, Same Output

Live — Data Profiling Skill

Hooks — Solving Audit & Control

Real-World Hook Patterns

A Real Hook in settings.json

Memory — Solving the Experience Problem

👤 User

🔄 Feedback

📌 Project

🔗 Reference

The Learning Loop

MEMORY.md — the index stays small

Summary & What’s Next

The Full Picture

You Are the Manager Now

What’s Coming in Part 2

Multi-agent framework

Foreground vs background

Parallel orchestration

User touch points

Shared memory

Tool calls

Atomic code blocks

Hooks for validation

This Slide Was Written by AI

Hello, lovely people of SQLBits!

See you in 10 minutes

Part 2 starts in ~10 min

Q&A

Questions?

System Prompt¹