taldbt - AI Migration Agent

Why taldbt

Not another regex mapper

taldbt is a Gen 3 AI agent. It parses Talend XML into an AST, infers intent, and generates modular CTE-based dbt SQL autonomously.

🧠

Semantic AST

XML → Abstract Syntax Tree → intent inference. Not string replacement.

🤖

AI Translation

Qwen3-Coder translates Java expressions. Knowledge base handles 96% deterministically.

🔍

Smart Validation

Row counts, NULL analysis, key uniqueness, Java artifact detection. Per-model.

⚡

Temporal Orchestration

DAG-aware workflow execution. Parallel jobs, retry logic, full history.

🧪

Synthetic Test Data

Faker generates realistic data. FK patching. 60+ column patterns recognized.

🦆

DuckDB + Flock

In-process analytics with LLM-in-SQL via the flock extension. Zero infrastructure.

How It Works

Four steps to modern dbt

From legacy Talend XML to production-ready dbt project with full validation.

Load Your Talend Project

Point to your exported Talend project folder or upload a ZIP. taldbt scans every .item file, extracts components, schemas, connections, and context parameters.

Folder path or ZIP upload
Auto-detects project structure
Classifies jobs, joblets, contexts
Works with any Talend version

Review & Inspect

Full X-ray of your project: jobs with confidence scores, source schemas with column types and keys, orchestration chains, dependency DAG, and dead job detection.

24 jobs, 53 sources, 114 components scanned
Talend screenshots with tMap visualization
Orchestration chain: Master_Job → SubJobs
DAG build order with dead job warnings

Configure & Migrate

Choose Quick (dbt models only) or AutoPilot (full pipeline: models + test data + validation + Temporal). Toggle AI assist, skip dead jobs, configure test data rows.

AI assist powered by local Ollama or cloud Cerebras
Smart dead job detection and skip
Synthetic test data with Faker (60+ patterns)
AutoPilot runs the entire pipeline hands-free

Results & Temporal

20/20 models pass, 100% validation, 55 test tables generated. Download the dbt project ZIP or launch Temporal to orchestrate the workflow with full DAG-aware execution and dashboard.

Per-model validation: schema, NULL, key uniqueness, Java artifacts
One-click Temporal launch with Master_Job workflow
35-second full pipeline execution
Download production-ready dbt project

Technology

Built with the modern data stack

Every component chosen for a reason. No bloat, no unnecessary abstractions.

Pipeline Architecture

lxml Parser
XML → AST

→

networkx DAG
Dependency Graph

→

Code Generator
CTE-based SQL

→

LLM Review
Ollama / Cerebras

DuckDB + dbt
Validate + Run

→

Temporal.io
DAG Orchestration

→

Production dbt
Ready to deploy

🦆

DuckDB + Flock Extension

In-process OLAP engine with the flock extension for LLM-in-SQL queries. Creates semantic validation via llm_complete() inside SQL. Auto-registers Ollama or cloud models.

Data Engine

🔄

dbt-core + dbt-duckdb

Industry-standard transformation framework. Generates {{ config() }}, {{ source() }}, {{ ref() }} Jinja tags. Full project scaffolding with profiles.yml.

Transform

⚡

Temporal.io

Workflow orchestration that mirrors Talend's job chains. Parent-child workflows, parallel execution, retry policies, full event history dashboard.

Orchestration

🧠

Ollama / Cerebras / Groq

Local Qwen3-Coder:30B via Ollama, or cloud Qwen3-235B via Cerebras (free). Auto-fallback chain with health checks. Only ~5% of expressions need AI.

AI Engine

🔬

sqlglot

Multi-dialect SQL parser and transpiler. Validates generated SQL, detects syntax errors, handles MSSQL/Teradata/MySQL → DuckDB dialect translation.

Validation

🎲

Faker

Realistic synthetic data generation. 60+ column name patterns mapped to contextual generators. FK patching ensures referential integrity across tables.

Test Data

📊

networkx

Directed Acyclic Graph construction from Talend's SUBJOB_OK, COMPONENT_OK, and RUN_IF triggers. Topological sort drives dbt build order.

Graphing

📐

Pydantic + lxml

Type-safe AST models with Pydantic. High-performance XML parsing with lxml. Every Talend component mapped to a structured Python object.

Parsing

Get Started

Four ways to run taldbt

Choose what works for your environment. All free, all production-ready.

🐳

Docker Image

Full stack: Streamlit + Ollama + Temporal + DuckDB. One command, GPU-accelerated AI.

docker pull souravetl/taldbt:latestCopy

Pull from Docker Hub

🐍

pip install

Lightweight Python package. CLI + AI migration agent. Add [ui] for Streamlit, [all] for everything.

pip install taldbt==0.2.1Copy

View on PyPI

🌐

Live Web App

No install needed. Upload your Talend ZIP, get a dbt project back. Powered by Cerebras AI cloud.

taldbt.streamlit.app

Open Live App

📦

Sample Project

AdventureWorks — 24 jobs, 53 sources, 114 components. Oracle + MySQL + MSSQL. Perfect for testing.

Download ZIP → Upload to taldbt

Download Sample Project

Documentation

User Manual

Everything you need to get started and go to production.

Quick Start

pip install

Docker Setup

Cloud Deploy

LLM Providers

Temporal

FAQ

Quick Start (Docker — Recommended)

Install Docker Desktop

Pull and run:

docker pull souravetl/taldbt:latest
docker pull ollama/ollama:latest
docker compose up -d

Pull the AI model:

docker exec taldbt-ollama ollama pull qwen3-coder:30b

Open http://localhost:8501
Download the sample Talend project to test
Upload ZIP → Review → AutoPilot → Download dbt project

pip install

Install taldbt as a Python package. Lightweight, no Docker needed.

# Core AI migration agent + CLI
pip install taldbt==0.2.1

# With Streamlit web UI
pip install taldbt[ui]==0.2.1

# With Temporal orchestration
pip install taldbt[temporal]==0.2.1

# Everything
pip install taldbt[all]==0.2.1

CLI usage:

# Launch the web UI
taldbt ui

# Discover and analyze a Talend project
taldbt discover ./my_talend_project

# Full migration to dbt
taldbt migrate ./my_talend_project ./dbt_output

# Check version
taldbt version

Requirements: Python 3.10+ and Ollama for local AI (optional, falls back to free cloud AI).

Docker Setup

The Docker stack includes Streamlit, Ollama (GPU), Temporal, and DuckDB — all pre-configured.

docker pull souravetl/taldbt:latest
docker pull ollama/ollama:latest

# Download docker-compose.yml from Docker Hub description
docker compose up -d

# Pull the AI model
docker exec taldbt-ollama ollama pull qwen3-coder:30b

# Open
# App:      http://localhost:8501
# Temporal: http://localhost:8233

No GPU? Use CPU override:

docker compose -f docker-compose.yml -f docker-compose.cpu.yml up -d

Air-gapped? Use build-dist.bat to create an offline package with tar files.

Streamlit Cloud Deployment

The live app at taldbt.streamlit.app runs on Streamlit Cloud with Cerebras AI.

How it works on cloud:

Upload your Talend project as a ZIP file (no folder path access)
AI translation uses Cerebras Qwen3-235B (free tier, 1M tokens/day)
Groq Qwen3-32B auto-fallback if Cerebras rate-limits
Temporal runs inline (dbt models in DAG order) — no CLI needed
Download the generated dbt project as ZIP

To deploy your own instance: Contact [email protected] for enterprise licensing and private deployment.

LLM Provider Chain

taldbt auto-detects and chains LLM providers with intelligent fallback:

Priority: Ollama (local) → Cerebras → Groq → OpenRouter

How it works:

549 component KB entries handle ~96% of translations deterministically
Only ~5% of Java expressions route to the LLM
If the primary provider returns 429/503, auto-falls back to next
Health checks cached for 5 minutes to avoid probing on every call
<think> blocks from reasoning models are stripped automatically

DuckDB Flock Extension:

The flock extension connects DuckDB to the active LLM provider. This enables llm_complete() inside SQL queries for semantic validation — checking if generated SQL matches the original Talend intent. Auto-registers the active model (local or cloud) on every DuckDB connection.

Free providers:

Cerebras — Qwen3-235B, 1,400 tok/s, 1M tokens/day free. Sign up
Groq — Qwen3-32B, 535 tok/s, 1,000 req/day free. Sign up

Temporal Workflow Orchestration

taldbt translates Talend's orchestration (tRunJob, tParallelize) into Temporal workflows.

What gets generated:

workflows.py — Parent/child workflows mirroring Talend job chains
activities.py — run_dbt_model activity for each data job
worker.py — Registers workflows + activities on task queue
run_workflow.py — Triggers the root workflow

Execution flow:

Master_Job (parent)
  ├─ ParallelJobWorkflow (child)
  │   ├─ run_dbt_model("dimproducts_copy")
  │   └─ run_dbt_model("productvendor_copy")
  ├─ ProductSubJobsWorkflow (child)
  │   ├─ run_dbt_model("dimproductcosthistory_copy1")
  │   └─ run_dbt_model("load_dimprodinventory_copy")
  └─ run_dbt_model("shipmethodmysql_op")

Dashboard: Available at http://localhost:8233 when running Docker or local Temporal CLI. Shows workflow status, history, timing, and state transitions.

Frequently Asked Questions

What Talend components are supported?

549 component knowledge base entries covering tMap, tDBInput/Output (all databases), tFilterRow, tAggregateRow, tSortRow, tUniqRow, tJavaRow, tReplicate, tRunJob, tParallelize, tFlowToIterate, and more. Custom Java goes through the AI translation pipeline.

What databases does it handle?

48% MSSQL, 37% Teradata, 14% MySQL from our analysis of 1,595 .item files across 8 GitHub repos. DuckDB handles all dialect translation via sqlglot.

Does it work without a GPU?

Yes. Use cloud AI (Cerebras/Groq — free) or Ollama CPU mode. The knowledge base handles 96% deterministically without any LLM.

Can I use my own LLM?

Yes. Any OpenAI-compatible endpoint works. Set LLM_PROVIDER=custom with your base_url and API key.

Is my data secure?

All processing happens locally (Docker/desktop) or in your Streamlit Cloud instance. No data leaves your environment. API keys are stored in encrypted Streamlit secrets.

taldbt AI Migration Agent

Not another regex mapper

Four steps to modern dbt

Built with the modern data stack

Four ways to run taldbt

Docker Image

pip install

Live Web App

Sample Project

User Manual

Quick Start (Docker — Recommended)

pip install

Docker Setup

Streamlit Cloud Deployment

LLM Provider Chain

Temporal Workflow Orchestration

Frequently Asked Questions

taldbt
AI Migration Agent