In LLM

Choose the optimal inference mode for LLM based on speed and accuracy requirements

Intelligently balances speed and accuracy for optimal performance. This mode evaluates quantized models across multiple benchmark datasets (MMLU, TruthfulQA, CNN/DailyMail, GSM8K) and selects the fastest model that maintains accuracy within acceptable thresholds:

Ensures the original model scores are meaningful (> 0.2) before comparison
Allows maximum 15% accuracy drop compared to the original model
Requires absolute score difference < 0.05 to be considered acceptable
Prioritizes speed among models meeting accuracy requirements

This mode is ideal for most production use cases where both performance and quality matter.

Speed First

Maximizes inference speed with minimum latency by selecting the most aggressive quantization. Recommended for real-time applications where response time is the top priority and slight accuracy trade-offs are acceptable.

Accuracy First

Delivers the highest precision by intelligently selecting the optimal quantization:

Without Dataset Specification (LLMDataSetType = null):

Prioritizes less quantized models (higher bit-width)
Ensures maximum fidelity to the original model

With Dataset Specification (LLMDataSetType provided):

Selects the quantization type with the highest score on the specified benchmark dataset
Available datasets:
- MMLU: Massive Multitask Language Understanding
- TRUTHFULQA: TruthfulQA benchmark
- CNN_DAILYMAIL: CNN/DailyMail summarization
- GSM8K: Grade School Math 8K

Use this mode when accuracy is paramount and you have specific evaluation criteria.

The optimal mode is automatically determined based on:

Accuracy benchmarks across multiple datasets (MMLU, TruthfulQA, CNN/DailyMail, GSM8K)
Performance metrics for each device (tokens per second)

You can override this automatic selection by explicitly specifying a mode and optionally a target dataset.

API Usage

// Default: Auto Mode (Balanced)
// Selects the fastest model maintaining <15% accuracy drop
private val model_auto = ZeticMLangeLLMModel(
    CONTEXT,
    $PERSONAL_KEY,
    $MODEL_NAME,
    $VERSION,
    LLMModelMode.RUN_AUTO
)

// Speed First Mode
// Most aggressive quantization for minimum latency
private val model_fast = ZeticMLangeLLMModel(
    CONTEXT,
    $PERSONAL_KEY,
    $MODEL_NAME,
    $VERSION,
    LLMModelMode.RUN_SPEED
)

// Accuracy First Mode (General)
// Prioritizes less quantized models
private val model_accurate = ZeticMLangeLLMModel(
    CONTEXT,
    $PERSONAL_KEY,
    $MODEL_NAME,
    $VERSION,
    LLMModelMode.RUN_ACCURACY
)

// Accuracy First Mode (Dataset-Specific)
// Optimized for specific benchmark performance
private val model_accurate_mmlu = ZeticMLangeLLMModel(
    CONTEXT,
    $PERSONAL_KEY,
    $MODEL_NAME,
    $VERSION,
    LLMModelMode.RUN_ACCURACY,
    LLMDataSetType.MMLU  // or TRUTHFULQA, CNN_DAILYMAIL, GSM8K
)

// Default: Auto Mode (Balanced)
// Selects the fastest model maintaining <15% accuracy drop
let model_auto = try ZeticMLangeLLMModel(
    $PERSONAL_KEY,
    $MODEL_NAME,
    $VERSION,
    LLMModelMode.RUN_AUTO
)

// Speed First Mode
// Most aggressive quantization for minimum latency
let model_fast = try ZeticMLangeLLMModel(
    $PERSONAL_KEY,
    $MODEL_NAME,
    $VERSION,
    LLMModelMode.RUN_SPEED
)

// Accuracy First Mode (General)
// Prioritizes less quantized models
let model_accurate = try ZeticMLangeLLMModel(
    $PERSONAL_KEY,
    $MODEL_NAME,
    $VERSION,
    LLMModelMode.RUN_ACCURACY
)

// Accuracy First Mode (Dataset-Specific)
// Optimized for specific benchmark performance
let model_accurate_mmlu = try ZeticMLangeLLMModel(
    $PERSONAL_KEY,
    $MODEL_NAME,
    $VERSION,
    LLMModelMode.RUN_ACCURACY,
    LLMDataSetType.MMLU  // or TRUTHFULQA, CNN_DAILYMAIL, GSM8K
)

Benchmark-Based Selection

MLange automatically determines the optimal quantization based on comprehensive benchmark results across multiple datasets. The platform evaluates each quantized version against the original model and your target device capabilities to ensure the best balance of speed and accuracy.

Each mode uses multi-dataset evaluation to select the most suitable quantization for your specific requirements and hardware constraints.

Need Help?

For collaboration opportunities or feature requests, contact us at contact@zetic.ai.

In LLM

Available Modes

Default (Auto)

Speed First

Accuracy First

API Usage

Benchmark-Based Selection

Need Help?

On this page