v0.2.0 — AI Migration Agent | Cerebras & Groq cloud

taldbt
AI Migration Agent

Convert legacy Talend ETL to modern dbt SQL in minutes. Semantic AI agent that reads your Talend jobs, understands the logic, and writes production dbt SQL. Full workflow orchestration with Temporal.

🐍 pip install taldbt v0.2.1 Copy
🦆 DuckDB + Flock
🔄 dbt-core
Temporal.io
🧠 Qwen3-Coder AI
549
Component KB
96%
Deterministic
100%
dbt Pass Rate
~35s
Full Pipeline
Why taldbt

Not another regex mapper

taldbt is a Gen 3 AI agent. It parses Talend XML into an AST, infers intent, and generates modular CTE-based dbt SQL autonomously.

🧠
Semantic AST
XML → Abstract Syntax Tree → intent inference. Not string replacement.
🤖
AI Translation
Qwen3-Coder translates Java expressions. Knowledge base handles 96% deterministically.
🔍
Smart Validation
Row counts, NULL analysis, key uniqueness, Java artifact detection. Per-model.
Temporal Orchestration
DAG-aware workflow execution. Parallel jobs, retry logic, full history.
🧪
Synthetic Test Data
Faker generates realistic data. FK patching. 60+ column patterns recognized.
🦆
DuckDB + Flock
In-process analytics with LLM-in-SQL via the flock extension. Zero infrastructure.
How It Works

Four steps to modern dbt

From legacy Talend XML to production-ready dbt project with full validation.

01
Load Your Talend Project
Point to your exported Talend project folder or upload a ZIP. taldbt scans every .item file, extracts components, schemas, connections, and context parameters.
  • Folder path or ZIP upload
  • Auto-detects project structure
  • Classifies jobs, joblets, contexts
  • Works with any Talend version
Step 1: Load Talend project
02
Review & Inspect
Full X-ray of your project: jobs with confidence scores, source schemas with column types and keys, orchestration chains, dependency DAG, and dead job detection.
  • 24 jobs, 53 sources, 114 components scanned
  • Talend screenshots with tMap visualization
  • Orchestration chain: Master_Job → SubJobs
  • DAG build order with dead job warnings
03
Configure & Migrate
Choose Quick (dbt models only) or AutoPilot (full pipeline: models + test data + validation + Temporal). Toggle AI assist, skip dead jobs, configure test data rows.
  • AI assist powered by local Ollama or cloud Cerebras
  • Smart dead job detection and skip
  • Synthetic test data with Faker (60+ patterns)
  • AutoPilot runs the entire pipeline hands-free
04
Results & Temporal
20/20 models pass, 100% validation, 55 test tables generated. Download the dbt project ZIP or launch Temporal to orchestrate the workflow with full DAG-aware execution and dashboard.
  • Per-model validation: schema, NULL, key uniqueness, Java artifacts
  • One-click Temporal launch with Master_Job workflow
  • 35-second full pipeline execution
  • Download production-ready dbt project
Technology

Built with the modern data stack

Every component chosen for a reason. No bloat, no unnecessary abstractions.

Pipeline Architecture
lxml Parser
XML → AST
networkx DAG
Dependency Graph
Code Generator
CTE-based SQL
LLM Review
Ollama / Cerebras
DuckDB + dbt
Validate + Run
Temporal.io
DAG Orchestration
Production dbt
Ready to deploy
🦆
DuckDB + Flock Extension
In-process OLAP engine with the flock extension for LLM-in-SQL queries. Creates semantic validation via llm_complete() inside SQL. Auto-registers Ollama or cloud models.
Data Engine
🔄
dbt-core + dbt-duckdb
Industry-standard transformation framework. Generates {{ config() }}, {{ source() }}, {{ ref() }} Jinja tags. Full project scaffolding with profiles.yml.
Transform
Temporal.io
Workflow orchestration that mirrors Talend's job chains. Parent-child workflows, parallel execution, retry policies, full event history dashboard.
Orchestration
🧠
Ollama / Cerebras / Groq
Local Qwen3-Coder:30B via Ollama, or cloud Qwen3-235B via Cerebras (free). Auto-fallback chain with health checks. Only ~5% of expressions need AI.
AI Engine
🔬
sqlglot
Multi-dialect SQL parser and transpiler. Validates generated SQL, detects syntax errors, handles MSSQL/Teradata/MySQL → DuckDB dialect translation.
Validation
🎲
Faker
Realistic synthetic data generation. 60+ column name patterns mapped to contextual generators. FK patching ensures referential integrity across tables.
Test Data
📊
networkx
Directed Acyclic Graph construction from Talend's SUBJOB_OK, COMPONENT_OK, and RUN_IF triggers. Topological sort drives dbt build order.
Graphing
📐
Pydantic + lxml
Type-safe AST models with Pydantic. High-performance XML parsing with lxml. Every Talend component mapped to a structured Python object.
Parsing
Get Started

Four ways to run taldbt

Choose what works for your environment. All free, all production-ready.

🐳

Docker Image

Full stack: Streamlit + Ollama + Temporal + DuckDB. One command, GPU-accelerated AI.

docker pull souravetl/taldbt:latestCopy
Pull from Docker Hub
🐍

pip install

Lightweight Python package. CLI + AI migration agent. Add [ui] for Streamlit, [all] for everything.

pip install taldbt==0.2.1Copy
View on PyPI
🌐

Live Web App

No install needed. Upload your Talend ZIP, get a dbt project back. Powered by Cerebras AI cloud.

taldbt.streamlit.app
Open Live App
📦

Sample Project

AdventureWorks — 24 jobs, 53 sources, 114 components. Oracle + MySQL + MSSQL. Perfect for testing.

Download ZIP → Upload to taldbt
Download Sample Project
Documentation

User Manual

Everything you need to get started and go to production.

Quick Start
pip install
Docker Setup
Cloud Deploy
LLM Providers
Temporal
FAQ

Quick Start (Docker — Recommended)

  1. Install Docker Desktop
  2. Pull and run:
    docker pull souravetl/taldbt:latest
    docker pull ollama/ollama:latest
    docker compose up -d
  3. Pull the AI model:
    docker exec taldbt-ollama ollama pull qwen3-coder:30b
  4. Open http://localhost:8501
  5. Download the sample Talend project to test
  6. Upload ZIP → Review → AutoPilot → Download dbt project

pip install

Install taldbt as a Python package. Lightweight, no Docker needed.

# Core AI migration agent + CLI
pip install taldbt==0.2.1

# With Streamlit web UI
pip install taldbt[ui]==0.2.1

# With Temporal orchestration
pip install taldbt[temporal]==0.2.1

# Everything
pip install taldbt[all]==0.2.1

CLI usage:

# Launch the web UI
taldbt ui

# Discover and analyze a Talend project
taldbt discover ./my_talend_project

# Full migration to dbt
taldbt migrate ./my_talend_project ./dbt_output

# Check version
taldbt version

Requirements: Python 3.10+ and Ollama for local AI (optional, falls back to free cloud AI).

Docker Setup

The Docker stack includes Streamlit, Ollama (GPU), Temporal, and DuckDB — all pre-configured.

docker pull souravetl/taldbt:latest
docker pull ollama/ollama:latest

# Download docker-compose.yml from Docker Hub description
docker compose up -d

# Pull the AI model
docker exec taldbt-ollama ollama pull qwen3-coder:30b

# Open
# App:      http://localhost:8501
# Temporal: http://localhost:8233

No GPU? Use CPU override:

docker compose -f docker-compose.yml -f docker-compose.cpu.yml up -d

Air-gapped? Use build-dist.bat to create an offline package with tar files.

Streamlit Cloud Deployment

The live app at taldbt.streamlit.app runs on Streamlit Cloud with Cerebras AI.

How it works on cloud:

  • Upload your Talend project as a ZIP file (no folder path access)
  • AI translation uses Cerebras Qwen3-235B (free tier, 1M tokens/day)
  • Groq Qwen3-32B auto-fallback if Cerebras rate-limits
  • Temporal runs inline (dbt models in DAG order) — no CLI needed
  • Download the generated dbt project as ZIP

To deploy your own instance: Contact [email protected] for enterprise licensing and private deployment.

LLM Provider Chain

taldbt auto-detects and chains LLM providers with intelligent fallback:

Priority: Ollama (local) → Cerebras → Groq → OpenRouter

How it works:

  • 549 component KB entries handle ~96% of translations deterministically
  • Only ~5% of Java expressions route to the LLM
  • If the primary provider returns 429/503, auto-falls back to next
  • Health checks cached for 5 minutes to avoid probing on every call
  • <think> blocks from reasoning models are stripped automatically

DuckDB Flock Extension:

The flock extension connects DuckDB to the active LLM provider. This enables llm_complete() inside SQL queries for semantic validation — checking if generated SQL matches the original Talend intent. Auto-registers the active model (local or cloud) on every DuckDB connection.

Free providers:

  • Cerebras — Qwen3-235B, 1,400 tok/s, 1M tokens/day free. Sign up
  • Groq — Qwen3-32B, 535 tok/s, 1,000 req/day free. Sign up

Temporal Workflow Orchestration

taldbt translates Talend's orchestration (tRunJob, tParallelize) into Temporal workflows.

What gets generated:

  • workflows.py — Parent/child workflows mirroring Talend job chains
  • activities.pyrun_dbt_model activity for each data job
  • worker.py — Registers workflows + activities on task queue
  • run_workflow.py — Triggers the root workflow

Execution flow:

Master_Job (parent)
  ├─ ParallelJobWorkflow (child)
  │   ├─ run_dbt_model("dimproducts_copy")
  │   └─ run_dbt_model("productvendor_copy")
  ├─ ProductSubJobsWorkflow (child)
  │   ├─ run_dbt_model("dimproductcosthistory_copy1")
  │   └─ run_dbt_model("load_dimprodinventory_copy")
  └─ run_dbt_model("shipmethodmysql_op")

Dashboard: Available at http://localhost:8233 when running Docker or local Temporal CLI. Shows workflow status, history, timing, and state transitions.

Frequently Asked Questions

What Talend components are supported?

549 component knowledge base entries covering tMap, tDBInput/Output (all databases), tFilterRow, tAggregateRow, tSortRow, tUniqRow, tJavaRow, tReplicate, tRunJob, tParallelize, tFlowToIterate, and more. Custom Java goes through the AI translation pipeline.

What databases does it handle?

48% MSSQL, 37% Teradata, 14% MySQL from our analysis of 1,595 .item files across 8 GitHub repos. DuckDB handles all dialect translation via sqlglot.

Does it work without a GPU?

Yes. Use cloud AI (Cerebras/Groq — free) or Ollama CPU mode. The knowledge base handles 96% deterministically without any LLM.

Can I use my own LLM?

Yes. Any OpenAI-compatible endpoint works. Set LLM_PROVIDER=custom with your base_url and API key.

Is my data secure?

All processing happens locally (Docker/desktop) or in your Streamlit Cloud instance. No data leaves your environment. API keys are stored in encrypted Streamlit secrets.