LLM Inference Modes
Configure LLM inference modes for speed or accuracy with ZETIC Melange.
Melange provides inference modes for LLM models to let you balance between speed and accuracy based on your application's requirements.
Available Modes
Speed (Available)
Maximizes inference speed with minimum latency by selecting the most aggressive quantization. Recommended for real-time applications where response time is the top priority and slight accuracy trade-offs are acceptable.
Auto (Paused)
The performance metric for LLM accuracy is being updated. New models are currently available in Speed mode only.
Intelligently balances speed and accuracy for optimal performance. This mode evaluates quantized models across multiple benchmark datasets (MMLU, TruthfulQA, CNN/DailyMail, GSM8K) and selects the fastest model that maintains accuracy within acceptable thresholds:
- Ensures the original model scores are meaningful (> 0.2) before comparison
- Allows maximum 15% accuracy drop compared to the original model
- Requires absolute score difference < 0.05 to be considered acceptable
- Prioritizes speed among models meeting accuracy requirements
Accurate (Paused)
Delivers the highest precision by intelligently selecting the optimal quantization.
Without Dataset Specification (LLMDataSetType = null):
- Prioritizes less quantized models (higher bit-width)
- Ensures maximum fidelity to the original model
With Dataset Specification (LLMDataSetType provided):
- Selects the quantization type with the highest score on the specified benchmark dataset
- Available datasets:
MMLU,TRUTHFULQA,CNN_DAILYMAIL,GSM8K
API Usage
// Speed First Mode
val modelFast = ZeticMLangeLLMModel(
CONTEXT,
PERSONAL_KEY,
MODEL_NAME,
VERSION,
LLMModelMode.RUN_SPEED
)
// Auto Mode (Paused)
val modelAuto = ZeticMLangeLLMModel(
CONTEXT,
PERSONAL_KEY,
MODEL_NAME,
VERSION,
LLMModelMode.RUN_AUTO
)
// Accuracy First Mode (Paused)
val modelAccurate = ZeticMLangeLLMModel(
CONTEXT,
PERSONAL_KEY,
MODEL_NAME,
VERSION,
LLMModelMode.RUN_ACCURACY
)
// Accuracy First with Dataset (Paused)
val modelAccurateMmlu = ZeticMLangeLLMModel(
CONTEXT,
PERSONAL_KEY,
MODEL_NAME,
VERSION,
LLMModelMode.RUN_ACCURACY,
LLMDataSetType.MMLU // or TRUTHFULQA, CNN_DAILYMAIL, GSM8K
)// Speed First Mode
let modelFast = try ZeticMLangeLLMModel(
personalKey: PERSONAL_KEY,
name: MODEL_NAME,
version: VERSION,
modelMode: RUN_SPEED
)
// Auto Mode (Paused)
let modelAuto = try ZeticMLangeLLMModel(
personalKey: PERSONAL_KEY,
name: MODEL_NAME,
version: VERSION,
modelMode: RUN_AUTO
)
// Accuracy First Mode (Paused)
let modelAccurate = try ZeticMLangeLLMModel(
personalKey: PERSONAL_KEY,
name: MODEL_NAME,
version: VERSION,
modelMode: RUN_ACCURACY
)
// Accuracy First with Dataset (Paused)
let modelAccurateMmlu = try ZeticMLangeLLMModel(
personalKey: PERSONAL_KEY,
name: MODEL_NAME,
version: VERSION,
modelMode: RUN_ACCURACY,
dataSetType: MMLU // or TRUTHFULQA, CNN_DAILYMAIL, GSM8K
)Benchmark-Based Selection
Melange automatically determines the optimal quantization based on comprehensive benchmark results across multiple datasets. The platform evaluates each quantized version against the original model and your target device capabilities to ensure the best balance of speed and accuracy.
Each mode uses multi-dataset evaluation to select the most suitable quantization for your specific requirements and hardware constraints.
Next Steps
- LLM Inference Overview: Getting started with LLM inference
- Supported LLM Models: Available models
- Inference Mode Selection (General): Modes for non-LLM models