(APIServer pid=390) INFO 03-19 00:27:09 [utils.py:302] 
(APIServer pid=390) INFO 03-19 00:27:09 [utils.py:302]        █     █     █▄   ▄█
(APIServer pid=390) INFO 03-19 00:27:09 [utils.py:302]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.17.1
(APIServer pid=390) INFO 03-19 00:27:09 [utils.py:302]   █▄█▀ █     █     █     █  model   Qwen/Qwen2.5-0.5B-Instruct
(APIServer pid=390) INFO 03-19 00:27:09 [utils.py:302]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=390) INFO 03-19 00:27:09 [utils.py:302] 
(APIServer pid=390) INFO 03-19 00:27:09 [utils.py:238] non-default args: {'model_tag': 'Qwen/Qwen2.5-0.5B-Instruct', 'host': '0.0.0.0', 'model': 'Qwen/Qwen2.5-0.5B-Instruct', 'max_model_len': 4096, 'gpu_memory_utilization': 0.85}
(APIServer pid=390) WARNING 03-19 00:27:09 [envs.py:1710] Unknown vLLM environment variable detected: VLLM_GPU_MEMORY_UTILIZATION
(APIServer pid=390) WARNING 03-19 00:27:09 [envs.py:1710] Unknown vLLM environment variable detected: VLLM_MAX_MODEL_LEN
(APIServer pid=390) WARNING 03-19 00:27:09 [envs.py:1710] Unknown vLLM environment variable detected: VLLM_VENV_DIR
(APIServer pid=390) WARNING 03-19 00:27:09 [envs.py:1710] Unknown vLLM environment variable detected: VLLM_HOST
(APIServer pid=390) WARNING 03-19 00:27:09 [envs.py:1710] Unknown vLLM environment variable detected: VLLM_INSTALL_COMMAND
(APIServer pid=390) WARNING 03-19 00:27:09 [envs.py:1710] Unknown vLLM environment variable detected: VLLM_READINESS_TIMEOUT_SECONDS
(APIServer pid=390) INFO 03-19 00:27:18 [model.py:531] Resolved architecture: Qwen2ForCausalLM
(APIServer pid=390) INFO 03-19 00:27:18 [model.py:1554] Using max model len 4096
(APIServer pid=390) INFO 03-19 00:27:18 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=390) INFO 03-19 00:27:18 [vllm.py:747] Asynchronous scheduling is enabled.
(EngineCore_DP0 pid=728) INFO 03-19 00:27:26 [core.py:101] Initializing a V1 LLM engine (v0.17.1) with config: model='Qwen/Qwen2.5-0.5B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-0.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen2.5-0.5B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=728) INFO 03-19 00:27:26 [network_utils.py:187] Port 8000 is already in use, trying port 8001
(EngineCore_DP0 pid=728) INFO 03-19 00:27:26 [network_utils.py:187] Port 8001 is already in use, trying port 8002
(EngineCore_DP0 pid=728) INFO 03-19 00:27:27 [parallel_state.py:1393] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.186.2:8002 backend=nccl
(EngineCore_DP0 pid=728) INFO 03-19 00:27:27 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore_DP0 pid=728) INFO 03-19 00:27:27 [base.py:106] Offloader set to NoopOffloader
(EngineCore_DP0 pid=728) INFO 03-19 00:27:27 [gpu_model_runner.py:4281] Starting to load model Qwen/Qwen2.5-0.5B-Instruct...
(EngineCore_DP0 pid=728) INFO 03-19 00:27:28 [cuda.py:405] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore_DP0 pid=728) INFO 03-19 00:27:28 [flash_attn.py:587] Using FlashAttention version 2
(EngineCore_DP0 pid=728) <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore_DP0 pid=728) <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
(EngineCore_DP0 pid=728) INFO 03-19 00:27:32 [weight_utils.py:561] Time spent downloading weights for Qwen/Qwen2.5-0.5B-Instruct: 3.887840 seconds
(EngineCore_DP0 pid=728) INFO 03-19 00:27:33 [weight_utils.py:601] No model.safetensors.index.json found in remote.
(EngineCore_DP0 pid=728) Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
(EngineCore_DP0 pid=728) Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.29it/s]
(EngineCore_DP0 pid=728) Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.28it/s]
(EngineCore_DP0 pid=728) 
(EngineCore_DP0 pid=728) INFO 03-19 00:27:33 [default_loader.py:293] Loading weights took 0.20 seconds
(EngineCore_DP0 pid=728) INFO 03-19 00:27:33 [gpu_model_runner.py:4364] Model loading took 0.93 GiB memory and 5.336132 seconds
(EngineCore_DP0 pid=728) INFO 03-19 00:27:38 [backends.py:916] Using cache directory: /root/.cache/vllm/torch_compile_cache/beacd980ce/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=728) INFO 03-19 00:27:38 [backends.py:976] Dynamo bytecode transform time: 4.19 s
(EngineCore_DP0 pid=728) INFO 03-19 00:27:42 [backends.py:350] Cache the graph of compile range (1, 2048) for later use
(EngineCore_DP0 pid=728) INFO 03-19 00:27:45 [backends.py:366] Compiling a graph for compile range (1, 2048) takes 6.31 s
(EngineCore_DP0 pid=728) INFO 03-19 00:27:45 [monitor.py:35] torch.compile takes 11.60 s in total
(EngineCore_DP0 pid=728) INFO 03-19 00:27:45 [decorators.py:580] saving AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/4726302a53ffde8501a8fd77988ca983466008c256abacb750503889963eb0f1/rank_0_0/model
(EngineCore_DP0 pid=728) INFO 03-19 00:27:46 [decorators.py:588] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/4726302a53ffde8501a8fd77988ca983466008c256abacb750503889963eb0f1/rank_0_0/model
(EngineCore_DP0 pid=728) INFO 03-19 00:27:54 [gpu_worker.py:424] Available KV cache memory: 36.25 GiB
(EngineCore_DP0 pid=728) INFO 03-19 00:27:54 [kv_cache_utils.py:1314] GPU KV cache size: 3,167,648 tokens
(EngineCore_DP0 pid=728) INFO 03-19 00:27:54 [kv_cache_utils.py:1319] Maximum concurrency for 4,096 tokens per request: 773.35x
(EngineCore_DP0 pid=728) Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   0%|          | 0/51 [00:00<?, ?it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   6%|▌         | 3/51 [00:00<00:01, 26.18it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  14%|█▎        | 7/51 [00:00<00:01, 30.34it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  22%|██▏       | 11/51 [00:00<00:01, 31.98it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  29%|██▉       | 15/51 [00:00<00:01, 33.09it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  37%|███▋      | 19/51 [00:00<00:00, 33.67it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  45%|████▌     | 23/51 [00:00<00:00, 33.75it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  53%|█████▎    | 27/51 [00:00<00:00, 33.79it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  61%|██████    | 31/51 [00:00<00:00, 33.85it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  69%|██████▊   | 35/51 [00:01<00:00, 33.48it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  76%|███████▋  | 39/51 [00:01<00:00, 33.13it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  84%|████████▍ | 43/51 [00:01<00:00, 32.60it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  92%|█████████▏| 47/51 [00:01<00:00, 32.43it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 51/51 [00:01<00:00, 30.84it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 51/51 [00:01<00:00, 32.29it/s]
(EngineCore_DP0 pid=728) Capturing CUDA graphs (decode, FULL):   0%|          | 0/35 [00:00<?, ?it/s]Capturing CUDA graphs (decode, FULL):  11%|█▏        | 4/35 [00:00<00:01, 30.40it/s]Capturing CUDA graphs (decode, FULL):  23%|██▎       | 8/35 [00:00<00:00, 31.78it/s]Capturing CUDA graphs (decode, FULL):  34%|███▍      | 12/35 [00:00<00:00, 32.17it/s]Capturing CUDA graphs (decode, FULL):  46%|████▌     | 16/35 [00:00<00:00, 32.00it/s]Capturing CUDA graphs (decode, FULL):  57%|█████▋    | 20/35 [00:00<00:00, 30.81it/s]Capturing CUDA graphs (decode, FULL):  69%|██████▊   | 24/35 [00:00<00:00, 31.17it/s]Capturing CUDA graphs (decode, FULL):  80%|████████  | 28/35 [00:00<00:00, 31.12it/s]Capturing CUDA graphs (decode, FULL):  91%|█████████▏| 32/35 [00:01<00:00, 30.83it/s]Capturing CUDA graphs (decode, FULL): 100%|██████████| 35/35 [00:01<00:00, 31.38it/s]
(EngineCore_DP0 pid=728) INFO 03-19 00:27:57 [gpu_model_runner.py:5386] Graph capturing finished in 3 secs, took 0.43 GiB
(EngineCore_DP0 pid=728) INFO 03-19 00:27:57 [core.py:282] init engine (profile, create kv cache, warmup model) took 24.01 seconds
(EngineCore_DP0 pid=728) INFO 03-19 00:27:59 [vllm.py:747] Asynchronous scheduling is enabled.
(APIServer pid=390) INFO 03-19 00:27:59 [api_server.py:495] Supported tasks: ['generate']
(APIServer pid=390) WARNING 03-19 00:27:59 [model.py:1355] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'repetition_penalty': 1.1, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=390) INFO 03-19 00:27:59 [serving.py:185] Warming up chat template processing...
(APIServer pid=390) INFO 03-19 00:28:01 [hf.py:318] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=390) INFO 03-19 00:28:01 [serving.py:210] Chat template warmup completed in 1651.7ms
(APIServer pid=390) INFO 03-19 00:28:01 [api_server.py:500] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=390) INFO 03-19 00:28:01 [launcher.py:38] Available routes are:
(APIServer pid=390) INFO 03-19 00:28:01 [launcher.py:47] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=390) INFO 03-19 00:28:01 [launcher.py:47] Route: /docs, Methods: HEAD, GET
(APIServer pid=390) INFO 03-19 00:28:01 [launcher.py:47] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=390) INFO 03-19 00:28:01 [launcher.py:47] Route: /redoc, Methods: HEAD, GET
(APIServer pid=390) INFO 03-19 00:28:01 [launcher.py:47] Route: /tokenize, Methods: POST
(APIServer pid=390) INFO 03-19 00:28:01 [launcher.py:47] Route: /detokenize, Methods: POST
(APIServer pid=390) INFO 03-19 00:28:01 [launcher.py:47] Route: /load, Methods: GET
(APIServer pid=390) INFO 03-19 00:28:01 [launcher.py:47] Route: /version, Methods: GET
(APIServer pid=390) INFO 03-19 00:28:01 [launcher.py:47] Route: /health, Methods: GET
(APIServer pid=390) INFO 03-19 00:28:01 [launcher.py:47] Route: /metrics, Methods: GET
(APIServer pid=390) INFO 03-19 00:28:01 [launcher.py:47] Route: /v1/models, Methods: GET
(APIServer pid=390) INFO 03-19 00:28:01 [launcher.py:47] Route: /ping, Methods: GET
(APIServer pid=390) INFO 03-19 00:28:01 [launcher.py:47] Route: /ping, Methods: POST
(APIServer pid=390) INFO 03-19 00:28:01 [launcher.py:47] Route: /invocations, Methods: POST
(APIServer pid=390) INFO 03-19 00:28:01 [launcher.py:47] Route: /v1/chat/completions, Methods: POST
(APIServer pid=390) INFO 03-19 00:28:01 [launcher.py:47] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=390) INFO 03-19 00:28:01 [launcher.py:47] Route: /v1/responses, Methods: POST
(APIServer pid=390) INFO 03-19 00:28:01 [launcher.py:47] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=390) INFO 03-19 00:28:01 [launcher.py:47] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=390) INFO 03-19 00:28:01 [launcher.py:47] Route: /v1/completions, Methods: POST
(APIServer pid=390) INFO 03-19 00:28:01 [launcher.py:47] Route: /v1/completions/render, Methods: POST
(APIServer pid=390) INFO 03-19 00:28:01 [launcher.py:47] Route: /v1/messages, Methods: POST
(APIServer pid=390) INFO 03-19 00:28:01 [launcher.py:47] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=390) INFO 03-19 00:28:01 [launcher.py:47] Route: /inference/v1/generate, Methods: POST
(APIServer pid=390) INFO 03-19 00:28:01 [launcher.py:47] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=390) INFO 03-19 00:28:01 [launcher.py:47] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=390) INFO:     Started server process [390]
(APIServer pid=390) INFO:     Waiting for application startup.
(APIServer pid=390) INFO:     Application startup complete.
(APIServer pid=390) INFO:     127.0.0.1:55678 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=390) INFO:     127.0.0.1:55688 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=390) INFO 03-19 00:28:11 [loggers.py:259] Engine 000: Avg prompt throughput: 3.1 tokens/s, Avg generation throughput: 0.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=390) INFO 03-19 00:28:21 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%