Skip to content

Train AI Models

What is Training?

Model training is the process where the AI algorithm learns to recognize patterns in labeled images to make accurate predictions on new, previously unseen images.

Prerequisites

📁 Prepared Dataset

  • Uploaded images: Minimum 50-100 images per class
  • Complete labeling: All images must be annotated
  • Class balance: Similar number of examples per category
  • Validation performed: No errors in annotations
  • Classification: 100-500 images per class minimum
  • Detection: 50-200 examples per object type
  • Segmentation: 20-100 images with precise masks
  • Anomalies: 200+ "normal" images

Start Training

🚀 From the Models Tab

  1. Open dataset that you want to train
  2. "Models" tab in the dataset view
  3. "Train Model" button or equivalent
  4. Training parameter configuration

⚙️ Basic Configuration

Main Parameters

  • Model name: Unique identifier
  • Description: Purpose and specific characteristics
  • Model type: According to your dataset (classification, detection, etc.)
  • Version: For version control

Data Split

  • Training (70%): Data for learning patterns
  • Validation (20%): Data for parameter adjustment
  • Test (10%): Data for final evaluation

🔧 Advanced Configurations

Hyperparameters

  • Learning Rate: Learning speed (0.001 typical)
  • Batch Size: Number of images processed simultaneously
  • Epochs: Number of complete passes through the data
  • Architecture: Type of neural network (ResNet, EfficientNet, etc.)

Data Augmentation

  • Rotation: Rotate images for greater variability
  • Zoom: Zoom in/out for different scales
  • Flip: Horizontal/vertical flip
  • Noise: Add variations for robustness

Training Process

📈 Training States

🟡 Starting

  • Data preparation: Loading and preprocessing
  • Model initialization: Architecture configuration
  • Parameter validation: Configuration verification

🔵 Training

  • Visible progress: Progress bar with percentage
  • Real-time metrics: Loss, accuracy per epoch
  • Estimated time: Approximate remaining duration
  • Cancellation possible: Option to stop if necessary

✅ Completed

  • Final metrics: Precision, recall, F1-score
  • Convergence graphs: Evolution during training
  • Saved model: Available for predictions
  • Detailed report: Complete analysis of results

❌ Failed

  • Error message: Problem description
  • Detailed logs: For technical diagnosis
  • Suggestions: Possible solutions
  • Retry: Option to correct and retrain

📊 Monitoring During Training

Visible Metrics

  • Loss: Should decrease over time
  • Accuracy: Should increase progressively
  • Validation Loss: To detect overfitting
  • Learning Curves: Evolution graphs

Health Indicators

  • ✅ Normal convergence: Loss decreases smoothly
  • ⚠️ Overfitting: Validation worsens while training improves
  • ❌ Underfitting: Both training and validation stagnate

Configurations by Model Type

🏷️ Classification

json
{
  "epochs": 50-100,
  "batch_size": 16-32,
  "learning_rate": 0.001,
  "optimizer": "Adam",
  "augmentation": true
}

🎯 Object Detection

json
{
  "epochs": 100-200,
  "batch_size": 8-16,
  "learning_rate": 0.0001,
  "backbone": "ResNet50",
  "anchor_sizes": [32, 64, 128]
}

🎨 Segmentation

json
{
  "epochs": 150-300,
  "batch_size": 4-8,
  "learning_rate": 0.0001,
  "architecture": "U-Net",
  "loss_function": "Dice + CrossEntropy"
}

🚨 Anomaly Detection

json
{
  "epochs": 100-200,
  "batch_size": 32-64,
  "learning_rate": 0.001,
  "latent_dim": 128,
  "reconstruction_loss": "MSE"
}

Results Evaluation

📊 Main Metrics

For Classification

  • Global Accuracy: Total percentage of correct predictions
  • Precision per Class: Accuracy for each category
  • Recall per Class: Ability to detect each class
  • Confusion Matrix: Confusion table between classes

For Detection

  • mAP@0.5: Mean precision with IoU > 0.5
  • mAP@0.5:0.95: Average precision at different thresholds
  • Detections per Image: Average number of objects found
  • False Positives/Negatives: Detection errors

For Segmentation

  • IoU per Class: Intersection over union per category
  • Dice Score: Similarity measure between masks
  • Pixel Accuracy: Percentage of correct pixels
  • Boundary F1: Precision at edges

📈 Convergence Analysis

Important Graphs

  • Training vs Validation Loss: To detect overfitting
  • Accuracy Curves: Precision evolution
  • Learning Rate Schedule: Learning speed adjustment
  • Gradient Flow: Gradient flow through layers

Interpretation

  • Parallel curves: Healthy training
  • Divergence: Possible overfitting
  • Stagnation: Possible underfitting or inadequate learning rate

Model Optimization

🎯 Performance Improvement

If Model Doesn't Converge

  • Increase epochs: More training time
  • Reduce learning rate: More gradual learning
  • Change architecture: More appropriate model
  • Review data: Dataset quality and balance

If There's Overfitting

  • Data Augmentation: More data variability
  • Dropout: Regularization during training
  • Early Stopping: Stop when validation worsens
  • More data: Increase dataset size

If There's Underfitting

  • More complex model: More layers or parameters
  • Increase learning rate: Faster learning
  • Less regularization: Reduce dropout
  • More epochs: More training time

⚡ Speed Optimization

During Training

  • Larger batch size: More parallelization (if memory allows)
  • Mixed precision: Use of float16 + float32
  • More powerful GPU: Specialized hardware
  • Optimized preprocessing: More efficient data loading

For Inference

  • Model pruning: Remove unnecessary connections
  • Quantization: Reduce weight precision
  • TensorRT: NVIDIA GPU-specific optimization
  • ONNX: Optimized format for production

Versioning and Management

📦 Version Control

  • Descriptive names: v1.0_high_precision, v2.0_fast
  • Metadata: Dates, parameters, dataset used
  • Comparison: Side-by-side metrics between versions
  • Rollback: Ability to return to previous version

💾 Storage

  • Cloud models: Automatic backup
  • Compression: Reduce file sizes
  • Checkpoints: Save points during training
  • Export formats: TensorFlow, PyTorch, ONNX

🔄 Lifecycle

  1. Training: Create new version
  2. Evaluation: Compare with previous versions
  3. Testing: Test on real data
  4. Deploy: Put into production
  5. Monitor: Track performance in use
  6. Retrain: Update with new data

Best Practices

📋 Before Training

  • Data cleaning: Review image and label quality
  • Exploratory analysis: Understand data distribution
  • Baseline: Establish minimum acceptable metrics
  • Strategy: Plan for different scenarios

🎯 During Training

  • Active monitoring: Watch real-time metrics
  • Regular checkpoints: Save progress
  • Detailed logging: Record parameters and results
  • Experimentation: Try different configurations

✅ After Training

  • Cross validation: Confirm results on independent data
  • Error analysis: Analyze cases where model fails
  • Documentation: Record configuration and results
  • Prepare deployment: Optimize for production