Enterprise-grade ETL pipeline orchestration for Kubernetes with seamless deduplication and joins
π Quick Start β’ π Documentation β’ ποΈ Architecture β’ π€ Contributing
The GlassFlow ETL Kubernetes Operator is a production-ready Kubernetes operator that enables scalable, cloud-native data pipeline deployments. Built as a companion to the GlassFlow ClickHouse ETL project, it provides enterprise-grade data processing capabilities with advanced features like deduplication, temporal joins, and seamless pause/resume functionality.
graph LR
KAFKA[Kafka Cluster]
subgraph "Kubernetes Cluster"
subgraph "GlassFlow ETL"
subgraph "Operator"
OP[Operator Controller]
CRD[Pipeline CRD]
end
subgraph "Data Pipeline"
ING[Ingestor Pods]
JOIN[Join Pod]
SINK[Sink Pod]
end
subgraph NATS_JETSTREAM["NATS JetStream"]
NATS[NATS]
DLQ[DLQ]
end
end
end
CH[ClickHouse]
subgraph "External"
API[GlassFlow API]
UI[Web UI]
end
API --> CRD
CRD --> OP
OP --> ING
OP --> JOIN
OP --> SINK
ING <--> NATS_JETSTREAM
JOIN <--> NATS_JETSTREAM
SINK <--> NATS_JETSTREAM
KAFKA --> ING
SINK --> CH
UI --> API
Deploy using the complete GlassFlow ETL stack from the GlassFlow Charts repository:
# Add GlassFlow Helm repository
helm repo add glassflow https://glassflow.github.io/charts
helm repo update
# Install complete GlassFlow ETL stack
helm install glassflow-etl glassflow/glassflow-etl
Deploy just the operator as a dependency:
# Install operator chart
helm install glassflow-operator glassflow/glassflow-operator
The operator includes automatic cleanup functionality that ensures all pipelines are immediately terminated when uninstalling:
helm uninstall glassflow-operator
This will:
# Clone the repository
git clone https://github.com/glassflow/glassflow-etl-k8s-operator.git
cd glassflow-etl-k8s-operator
# Install CRDs
make install
# Deploy operator
make deploy IMG=ghcr.io/glassflow/glassflow-etl-k8s-operator:latest
Create pipelines using the GlassFlow ClickHouse ETL backend API. The operator will automatically create the corresponding Pipeline CRDs. Hereβs an example of what the generated CRD will look like:
apiVersion: etl.glassflow.io/v1alpha1
kind: Pipeline
metadata:
name: user-events-pipeline
spec:
pipeline_id: "user-events-v1"
config: "pipeline-config"
dlq: "dead-letter-queue"
sources:
type: kafka
topics:
- topic_name: "user-events"
stream: "users"
dedup_window: 60000000000 # 1 minute in nanoseconds
join:
type: "temporal"
stream: "joined-users"
enabled: true
sink: "clickhouse"
| Feature | Status | Description |
|---|---|---|
| Pipeline Creation | β | Deploy new ETL pipelines via CRD |
| Pipeline Termination | β | Graceful shutdown and cleanup |
| Pipeline Pausing | β | Temporarily halt data processing |
| Pipeline Resuming | β | Resume paused pipelines |
| Deduplication | β | Configurable time-window deduplication |
| Stream Joins | β | Multi-stream data joining |
| Auto-scaling | β | Horizontal pod autoscaling / ingestor replicas support |
| Monitoring | β | Prometheus metrics integration |
| Helm Uninstall Cleanup | β | Automatic pipeline termination and CRD cleanup on uninstall |
The operator manages pipeline lifecycle through a comprehensive state machine that ensures reliable and predictable pipeline operations.
| Status | Type | Description |
|---|---|---|
| Created | Core | Pipeline created and ready to start |
| Running | Core | Pipeline is actively processing data |
| Stopped | Core | Pipeline has been stopped |
| Resuming | Transition | Pipeline is being resumed (temporary) |
| Stopping | Transition | Pipeline is being stopped (temporary) |
| Terminating | Transition | Pipeline is being terminated (temporary) |
stateDiagram
direction LR
[*] --> Created
Created --> Running
Running --> Stopping
Stopping --> Stopped
Stopped --> Resuming
Terminating --> Stopped
Resuming --> Running
Any --> Terminating
git clone https://github.com/glassflow/glassflow-etl-k8s-operator.git
cd glassflow-etl-k8s-operator
make help # See all available targets
# Install development tools
make controller-gen
make kustomize
make golangci-lint
# Start NATS with JetStream (must run inside the cluster)
helm repo add nats https://nats-io.github.io/k8s/helm/charts/
helm install nats nats/nats --set nats.jetstream.enabled=true
# Start Kafka (using Helm)
helm repo add bitnami https://charts.bitnami.com/bitnami
helm install kafka bitnami/kafka
# Start ClickHouse (using Helm)
helm install clickhouse bitnami/clickhouse
# Or use external Kafka/ClickHouse for development
# Run locally (requires NATS running inside the cluster)
make run
This project was built using Kubebuilder v4 and follows Kubernetes operator best practices:
βββ api/v1alpha1/ # CRD definitions
βββ internal/controller/ # Operator controller logic
βββ internal/nats/ # NATS client integration
βββ charts/ # Helm charts
βββ config/ # Kustomize configurations
βββ test/ # Unit and e2e tests
# Run e2e tests (requires Kind cluster) - Primary testing method
make test-e2e
# Run unit tests (coverage being improved)
make test
# Run linter
make lint
| Chart | Purpose | Components | Use Case |
|---|---|---|---|
| glassflow-etl | Complete ETL Platform | UI, API, Operator, NATS | Full-featured deployment |
| glassflow-operator | Operator Only | Operator, CRDs | Dependency for custom setups |
The glassflow-etl chart includes the complete platform with web UI, backend API, NATS, and the operator as dependencies. The glassflow-operator chart is designed as a dependency for the main chart or custom deployments.
We welcome contributions!
make testmake lintThis project is licensed under the Apache License 2.0 - see the clickhouse-etl LICENSE file for details.
Built by GlassFlow Team
Website β’ Documentation β’ GitHub