WaferGuard ML - Project Blueprint

Team: Milos, Hernan, Rolando, Oliver Duration: 10-12 weeks (5-6 sprints) Sprint Length: 2 weeks


System Architecture

flowchart TB
    subgraph Data["Data Ingestion Layer"]
        SECOM[(SECOM Sensor Data)]
        WM811K[(WM811K Wafer Images)]
        VAL[Data Validation]
    end

    subgraph ML["Machine Learning Pipeline"]
        PRE[Preprocessing]
        FE[Feature Engineering]
        CNN[CNN Model]
        AE[Autoencoder]
        ENS[Ensemble Model]
    end

    subgraph INF["Inference & Alert System"]
        API[FastAPI Endpoint]
        CONF[Confidence Scoring]
        ALERT[Alert Generator]
    end

    subgraph VIZ["Visualization Dashboard"]
        STREAM[Streamlit Dashboard]
        KPI[KPI Metrics]
        GCAM[GradCAM Views]
        HIST[Historical Trends]
    end

    subgraph MES["MES Integration"]
        SIM[MES Simulator]
        HOOK[Integration Hooks]
    end

    SECOM --> VAL
    WM811K --> VAL
    VAL --> PRE
    PRE --> FE
    FE --> CNN
    FE --> AE
    CNN --> ENS
    AE --> ENS
    ENS --> API
    API --> CONF
    CONF --> ALERT
    API --> STREAM
    ALERT --> STREAM
    STREAM --> KPI
    STREAM --> GCAM
    STREAM --> HIST
    API --> HOOK
    SIM --> HOOK


Data Flow

flowchart LR
    A[Raw Data] --> B[Preprocessing]
    B --> C[Feature Engineering]
    C --> D[Model Training]
    D --> E[Validation]
    E --> F[Deployment]
    F --> G[Real-time Inference]
    G --> H[Dashboard]
    H --> I[Alerts]

    style A fill:#e1f5fe
    style D fill:#fff3e0
    style H fill:#e8f5e9
    style I fill:#ffebee


Epics Overview

Why This Structure?

Your original epics were solid. I recommend: - Split KAN-1 into Data Pipeline (new KAN-1) and ML Model (keep as KAN-2) - Keep KAN-2 (MES Integration) as KAN-3 - Keep KAN-3 (Analytics+Dashboard) as KAN-4 - Keep KAN-4 (Governance) as KAN-5

This gives you a clear dependency chain: Data → Model → Integration → Dashboard → Documentation

Epic Dependency Flow

flowchart LR
    KAN1[KAN-1<br/>Data Pipeline]
    KAN2[KAN-2<br/>ML Model]
    KAN3[KAN-3<br/>MES Integration]
    KAN4[KAN-4<br/>Dashboard]
    KAN5[KAN-5<br/>Governance]

    KAN1 --> KAN2
    KAN2 --> KAN3
    KAN2 --> KAN4
    KAN3 --> KAN4
    KAN4 --> KAN5
    KAN3 --> KAN5

    style KAN1 fill:#bbdefb,color:#000000
    style KAN2 fill:#c8e6c9,color:#000000
    style KAN3 fill:#fff9c4,color:#000000
    style KAN4 fill:#ffccbc,color:#000000
    style KAN5 fill:#e1bee7,color:#000000


Sprint Timeline

gantt
    title WaferGuard ML - Sprint Timeline
    dateFormat  YYYY-MM-DD
    section KAN-1 Data Pipeline
        Sprint 1 - Foundation     :s1, 2026-01-27, 14d
        Sprint 2 - Data Pipeline  :s2, after s1, 14d
    section KAN-2 ML Model
        Sprint 3 - Model Dev      :s3, after s2, 14d
        Sprint 4 - Refinement     :s4, after s3, 14d
    section KAN-3 & KAN-4
        Sprint 4 - API            :s4b, after s3, 14d
        Sprint 5 - Dashboard      :s5, after s4, 14d
    section KAN-5 Governance
        Sprint 6 - Documentation  :s6, after s5, 14d


Sprint Breakdown

Sprint 1: Foundation (Week 1-2)

Epic: KAN-1 (Data Pipeline & Foundation) - 80% Done

Task ID Task Owner Status
KAN-8 Set up Python Environment Oliver DONE ✅
KAN-9 Initialize Git Repo With Branching Strategy + Rules Oliver DONE ✅
KAN-10 Download Secom Dataset + Verify Integrity Oliver DONE ✅
KAN-11 Download WM811 Image Analysis + Verify Oliver DONE ✅
KAN-12 Run EDA on Secom Sensor Data Rolando IN REVIEW 🔄
KAN-13 Run EDA on WM811K Image analysis Oliver DONE ✅
KAN-14 Data Quality Assessment Report Unassigned TO DO 📋
KAN-15 Set Up Jira Board with epics/stories Rolando DONE ✅
KAN-46 Research about how the pixels are generated from and actual image Milos DONE ✅
KAN-47 Create slides for initial images EDA findings Oliver DONE ✅

Sprint 1 Goal: Development environment ready, datasets loaded, initial EDA complete


Sprint 2: Data Pipeline (Week 3-4)

Epic: KAN-1 (Data Pipeline & Foundation)

Task ID Task Owner Story Points Status
KAN-1-010 SECOM preprocessing pipeline (missing values, outliers) TBD 5 TODO
KAN-1-011 Feature engineering for sensor data TBD 5 TODO
KAN-1-012 WM811K image preprocessing (resize, normalize) TBD 3 TODO
KAN-1-013 Image augmentation strategy (rotation, flip) TBD 3 TODO
KAN-1-014 Train/validation/test split implementation TBD 2 TODO
KAN-1-015 Data loader classes (PyTorch/TensorFlow) TBD 5 TODO
KAN-1-016 Handle class imbalance (SMOTE/class weights) TBD 3 TODO

Sprint 2 Goal: Clean, processed datasets ready for model training


Sprint 3: ML Model Development (Week 5-6)

Epic: KAN-2 (Anomaly Detection Model)

Task ID Task Owner Story Points Status
KAN-2-001 CNN architecture design for wafer classification TBD 5 TODO
KAN-2-002 Implement CNN model (PyTorch/TensorFlow) TBD 5 TODO
KAN-2-003 Autoencoder architecture for sensor anomaly TBD 5 TODO
KAN-2-004 Implement Autoencoder model TBD 5 TODO
KAN-2-005 Training pipeline with early stopping TBD 3 TODO
KAN-2-006 Hyperparameter tuning setup TBD 3 TODO
KAN-2-007 Baseline model benchmarking TBD 3 TODO

Sprint 3 Goal: Working CNN and Autoencoder models with baseline metrics


Sprint 4: Model Refinement + API (Week 7-8)

Epics: KAN-2 (Model), KAN-3 (MES Integration)

Task ID Task Owner Story Points Status
KAN-2-008 Model validation and cross-validation TBD 3 TODO
KAN-2-009 Ensemble model (CNN + Autoencoder) TBD 5 TODO
KAN-2-010 GradCAM implementation for explainability TBD 5 TODO
KAN-3-001 FastAPI inference endpoint design TBD 3 TODO
KAN-3-002 Implement /predict endpoint TBD 5 TODO
KAN-3-003 Confidence scoring and threshold logic TBD 3 TODO
KAN-3-004 MES simulation mock data generator TBD 3 TODO
KAN-3-005 Alert generation logic TBD 3 TODO

Sprint 4 Goal: Optimized models, working inference API


Sprint 5: Dashboard Development (Week 9-10)

Epic: KAN-4 (Analytics + Dashboard)

Task ID Task Owner Story Points Status
KAN-4-001 Dashboard wireframes in Figma TBD 3 TODO
KAN-4-002 Streamlit app skeleton TBD 2 TODO
KAN-4-003 Real-time anomaly detection view TBD 5 TODO
KAN-4-004 Historical trend analysis charts TBD 5 TODO
KAN-4-005 Model confidence visualization TBD 3 TODO
KAN-4-006 GradCAM attention map display TBD 5 TODO
KAN-4-007 Production KPIs (OEE, defect rate) TBD 3 TODO
KAN-4-008 Alert notification panel TBD 3 TODO
KAN-4-009 Interactive filtering and drill-down TBD 5 TODO

Sprint 5 Goal: Functional dashboard with all visualizations


Sprint 6: Evaluation & Documentation (Week 11-12)

Epic: KAN-5 (Governance & Audit-ability)

Task ID Task Owner Story Points Status
KAN-5-001 Comprehensive model evaluation report TBD 5 TODO
KAN-5-002 Comparison vs baseline (SPC, manual) TBD 5 TODO
KAN-5-003 Model versioning documentation TBD 3 TODO
KAN-5-004 API documentation (OpenAPI/Swagger) TBD 3 TODO
KAN-5-005 User guide for dashboard TBD 3 TODO
KAN-5-006 Technical architecture documentation TBD 3 TODO
KAN-5-007 Final presentation slides TBD 5 TODO
KAN-5-008 Live demo preparation TBD 3 TODO
KAN-5-009 Code cleanup and refactoring TBD 3 TODO

Sprint 6 Goal: Complete documentation, presentation ready


Work Stream Assignments

Work Stream Specialist Assistant Primary Epics
Data Engineering & ML Pipeline TBD TBD KAN-1
Model Development & Training TBD TBD KAN-2
Dashboard & Integration TBD TBD KAN-3, KAN-4
Evaluation & Documentation TBD TBD KAN-5

Team Structure

flowchart TB
    subgraph Team["WaferGuard ML Team"]
        subgraph WS1["Data Engineering"]
            DE1[Specialist: TBD]
            DE2[Assistant: TBD]
        end
        subgraph WS2["Model Development"]
            MD1[Specialist: TBD]
            MD2[Assistant: TBD]
        end
        subgraph WS3["Dashboard & Integration"]
            DI1[Specialist: TBD]
            DI2[Assistant: TBD]
        end
        subgraph WS4["Evaluation & Docs"]
            ED1[Specialist: TBD]
            ED2[Assistant: TBD]
        end
    end

    WS1 -->|Data| WS2
    WS2 -->|Models| WS3
    WS3 -->|System| WS4

    style WS1 fill:#bbdefb
    style WS2 fill:#c8e6c9
    style WS3 fill:#fff9c4
    style WS4 fill:#e1bee7


ML Model Architecture

flowchart TB
    subgraph Input["Input Data"]
        IMG[Wafer Images<br/>WM811K]
        SEN[Sensor Data<br/>SECOM]
    end

    subgraph CNN["CNN Pipeline"]
        C1[Conv2D + ReLU]
        C2[MaxPool]
        C3[Conv2D + ReLU]
        C4[MaxPool]
        C5[Flatten]
        C6[Dense + Softmax]
    end

    subgraph AE["Autoencoder Pipeline"]
        E1[Encoder]
        E2[Latent Space]
        E3[Decoder]
        E4[Reconstruction Error]
    end

    subgraph Ensemble["Ensemble Decision"]
        COMB[Score Combination]
        THRESH[Threshold Logic]
        OUT[Final Prediction]
    end

    IMG --> C1 --> C2 --> C3 --> C4 --> C5 --> C6
    SEN --> E1 --> E2 --> E3 --> E4
    C6 --> COMB
    E4 --> COMB
    COMB --> THRESH --> OUT

    style CNN fill:#e3f2fd
    style AE fill:#f3e5f5
    style Ensemble fill:#e8f5e9


Key Datasets

Dataset Description Size Source
SECOM Sensor data (590 features) ~1,500 samples UCI ML Repository
WM811K Wafer map images ~38,000 images Kaggle

Data Pipeline Overview

flowchart LR
    subgraph Sources["Data Sources"]
        S1[(SECOM<br/>UCI Repository)]
        S2[(WM811K<br/>Kaggle)]
    end

    subgraph Processing["Processing"]
        P1[Missing Value<br/>Imputation]
        P2[Outlier<br/>Detection]
        P3[Normalization]
        P4[Image Resize<br/>& Augment]
    end

    subgraph Output["Processed Data"]
        O1[Train Set<br/>70%]
        O2[Val Set<br/>15%]
        O3[Test Set<br/>15%]
    end

    S1 --> P1 --> P2 --> P3 --> O1 & O2 & O3
    S2 --> P4 --> O1 & O2 & O3

    style Sources fill:#e3f2fd
    style Processing fill:#fff8e1
    style Output fill:#e8f5e9


Success Criteria

Metric Target
CNN Accuracy >90%
False Positive Rate <5%
Inference Latency <500ms
Dashboard Refresh <2s

Story Points by Epic

pie showData
    title Story Points Distribution
    "KAN-1 Data Pipeline" : 47
    "KAN-2 ML Model" : 42
    "KAN-3 MES Integration" : 17
    "KAN-4 Dashboard" : 34
    "KAN-5 Governance" : 33


Inference Flow

sequenceDiagram
    participant MES as MES System
    participant API as FastAPI
    participant Model as ML Model
    participant DB as MLflow
    participant Dash as Dashboard
    participant Eng as Process Engineer

    MES->>API: POST /predict (wafer data)
    API->>Model: Load model from registry
    Model->>DB: Get latest model version
    DB-->>Model: Return model weights
    Model->>Model: Run inference
    Model-->>API: Prediction + confidence
    API->>API: Apply threshold logic

    alt Anomaly Detected
        API->>Dash: Send alert
        Dash->>Eng: Display notification
        API-->>MES: Response (anomaly: true)
    else Normal
        API-->>MES: Response (anomaly: false)
    end

    API->>Dash: Update real-time view


Sprint Workflow

stateDiagram-v2
    [*] --> Planning: Sprint Start
    Planning --> InProgress: Tasks Assigned
    InProgress --> CodeReview: PR Created
    CodeReview --> Testing: Approved
    Testing --> Done: Tests Pass
    CodeReview --> InProgress: Changes Requested
    Testing --> InProgress: Tests Fail
    Done --> [*]: Sprint End

    note right of Planning: 2 hour session
    note right of InProgress: Daily standups
    note right of Done: Sprint review


Current Sprint Status

Active Sprint: Sprint 1 - Foundation Sprint Start: TBD Sprint End: TBD

Sprint 1 Progress


Meeting Schedule

Meeting Frequency Duration Day/Time
Sprint Planning Bi-weekly 2 hours Sprint Day 1
Daily Standup Daily 15 min TBD
Sprint Review Bi-weekly 1.5 hours Sprint Last Day
Sprint Retro Bi-weekly 1 hour Sprint Last Day