Enable Expert Parallel for vLLM Inference Services
TOC
IntroductionWhen to Use Expert ParallelPrerequisites and LimitationsEP Configuration OverviewLayer Behavior When EP Is EnabledUpstream Command and Platform MappingConfigure a Single-Node InferenceServiceWhy These Flags MatterReview the Configured SpecFor Multi-Node DeploymentsReferencesIntroduction
This document shows a single-node, YAML-first starting point for enabling vLLM Expert Parallel (EP) in an InferenceService.
Expert Parallel is an upstream vLLM capability for Mixture-of-Experts (MoE) models. It is still experimental in vLLM, and the related argument names or defaults may change in future releases.
This page focuses on a single-node configuration example for getting started with Expert Parallel. For performance tuning, capacity planning, and distributed deployment details, refer to the official vLLM documentation.
When to Use Expert Parallel
Expert Parallel is relevant when you are serving an MoE model and want vLLM to shard the expert layers across GPUs instead of relying on the default expert-layer grouping behavior.
For a single-node deployment, the upstream vLLM pattern is:
- Enable EP with
--enable-expert-parallel. - Keep the example on one node.
- Use
--data-parallel-sizeto span the GPUs on that node. - Use
--tensor-parallel-size 1in this example so the attention layers stay replicated across data parallel ranks instead of being sharded with tensor parallelism.
If you are serving a dense model, or if your current runtime image does not include the EP-related dependencies required by upstream vLLM, this guide is not the right starting point.
Prerequisites and Limitations
- You have access to a Kubernetes cluster with KServe installed.
- You have a namespace where you can create
InferenceServiceresources. - You already have a vLLM serving runtime available on the platform.
- The runtime image you use already includes the upstream dependencies required for vLLM EP.
- Your model is an MoE model and is already accessible to the service through the configured
storageUri. - Your target node has multiple visible GPUs. This example uses the detected GPU count as the single-node data parallel size.
If your current vLLM image does not already include the required EP dependencies, extend or rebuild the runtime image first. For platform-specific runtime customization, see Extend Inference Runtimes. For the upstream dependency list and backend guidance, see the official vLLM EP deployment guide in References.
EP Configuration Overview
Enable EP by adding the --enable-expert-parallel flag. In upstream vLLM, the expert parallel size is computed automatically:
Where:
TP_SIZE: tensor parallel sizeDP_SIZE: data parallel sizeEP_SIZE: expert parallel size, computed automatically by vLLM
This means you do not set a separate EP size argument. Instead, you choose the tensor parallel and data parallel sizes, and vLLM derives the effective expert parallel group size from those settings.
Layer Behavior When EP Is Enabled
When EP is enabled for an MoE model, different layer types use different parallelism strategies:
For attention layers:
- When
TP_SIZE = 1, attention weights are replicated across all data parallel ranks. - When
TP_SIZE > 1, attention weights are sharded with tensor parallelism inside each data parallel group.
For example, if TP_SIZE = 2 and DP_SIZE = 4, the service uses 8 GPUs in total:
- The expert layers form one EP group of size 8, with experts distributed across all GPUs.
- The attention layers use tensor parallelism of size 2 inside each of the 4 data parallel groups.
Compared with a regular data parallel deployment, the main difference is how the MoE layers are distributed. Without --enable-expert-parallel, the MoE layers follow tensor parallel grouping behavior. With EP enabled, the expert layers switch to expert parallelism, which is designed specifically for MoE-style expert sharding.
Upstream Command and Platform Mapping
The upstream single-node example uses a command similar to the following:
On Alauda AI, these same flags are typically passed through the InferenceService container command. In other words:
vllm serve ...becomes the command launched insidespec.predictor.model.command--tensor-parallel-size,--data-parallel-size, and--enable-expert-parallelare appended to the vLLM startup command- model location, runtime name, and Kubernetes resources are expressed through
storageUri,runtime, andresources
This is why the following example focuses on how to place the EP-related flags into the platform's InferenceService YAML.
Configure a Single-Node InferenceService
Create a YAML file such as deepseek-v3-ep.yaml with the following content:
Apply the manifest:
Why These Flags Matter
Adjust the GPU resource fields to match the resource keys available in your cluster and the number of GPUs on the target node. The important part of this example is where the vLLM EP arguments are added in the InferenceService command. If you need to explicitly set an all-to-all backend, follow the upstream backend selection guide before adding --all2all-backend.
Review the Configured Spec
After applying the manifest, review the resulting InferenceService spec and confirm the EP-related arguments are present:
Focus on the generated predictor command and confirm that it still includes:
--enable-expert-parallel--data-parallel-size--tensor-parallel-size 1
This review confirms that the intended vLLM arguments were applied to the service configuration. It does not validate runtime performance, backend compatibility, or multi-node behavior.
For Multi-Node Deployments
Multi-node EP deployments require additional distributed runtime and networking configuration, including per-node launch settings, node roles, and data-parallel communication settings.
This page focuses on the single-node configuration pattern. If you need multi-node EP, refer to the official vLLM guide and adapt the deployment model to your cluster topology and runtime environment.