(APIServer pid=390) INFO 03-19 00:27:09 [utils.py:302] (APIServer pid=390) INFO 03-19 00:27:09 [utils.py:302] █ █ █▄ ▄█ (APIServer pid=390) INFO 03-19 00:27:09 [utils.py:302] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.17.1 (APIServer pid=390) INFO 03-19 00:27:09 [utils.py:302] █▄█▀ █ █ █ █ model Qwen/Qwen2.5-0.5B-Instruct (APIServer pid=390) INFO 03-19 00:27:09 [utils.py:302] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀ (APIServer pid=390) INFO 03-19 00:27:09 [utils.py:302] (APIServer pid=390) INFO 03-19 00:27:09 [utils.py:238] non-default args: {'model_tag': 'Qwen/Qwen2.5-0.5B-Instruct', 'host': '0.0.0.0', 'model': 'Qwen/Qwen2.5-0.5B-Instruct', 'max_model_len': 4096, 'gpu_memory_utilization': 0.85} (APIServer pid=390) WARNING 03-19 00:27:09 [envs.py:1710] Unknown vLLM environment variable detected: VLLM_GPU_MEMORY_UTILIZATION (APIServer pid=390) WARNING 03-19 00:27:09 [envs.py:1710] Unknown vLLM environment variable detected: VLLM_MAX_MODEL_LEN (APIServer pid=390) WARNING 03-19 00:27:09 [envs.py:1710] Unknown vLLM environment variable detected: VLLM_VENV_DIR (APIServer pid=390) WARNING 03-19 00:27:09 [envs.py:1710] Unknown vLLM environment variable detected: VLLM_HOST (APIServer pid=390) WARNING 03-19 00:27:09 [envs.py:1710] Unknown vLLM environment variable detected: VLLM_INSTALL_COMMAND (APIServer pid=390) WARNING 03-19 00:27:09 [envs.py:1710] Unknown vLLM environment variable detected: VLLM_READINESS_TIMEOUT_SECONDS (APIServer pid=390) INFO 03-19 00:27:18 [model.py:531] Resolved architecture: Qwen2ForCausalLM (APIServer pid=390) INFO 03-19 00:27:18 [model.py:1554] Using max model len 4096 (APIServer pid=390) INFO 03-19 00:27:18 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048. (APIServer pid=390) INFO 03-19 00:27:18 [vllm.py:747] Asynchronous scheduling is enabled. (EngineCore_DP0 pid=728) INFO 03-19 00:27:26 [core.py:101] Initializing a V1 LLM engine (v0.17.1) with config: model='Qwen/Qwen2.5-0.5B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-0.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen2.5-0.5B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': , 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': , 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': , 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []} (EngineCore_DP0 pid=728) INFO 03-19 00:27:26 [network_utils.py:187] Port 8000 is already in use, trying port 8001 (EngineCore_DP0 pid=728) INFO 03-19 00:27:26 [network_utils.py:187] Port 8001 is already in use, trying port 8002 (EngineCore_DP0 pid=728) INFO 03-19 00:27:27 [parallel_state.py:1393] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.186.2:8002 backend=nccl (EngineCore_DP0 pid=728) INFO 03-19 00:27:27 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A (EngineCore_DP0 pid=728) INFO 03-19 00:27:27 [base.py:106] Offloader set to NoopOffloader (EngineCore_DP0 pid=728) INFO 03-19 00:27:27 [gpu_model_runner.py:4281] Starting to load model Qwen/Qwen2.5-0.5B-Instruct... (EngineCore_DP0 pid=728) INFO 03-19 00:27:28 [cuda.py:405] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']. (EngineCore_DP0 pid=728) INFO 03-19 00:27:28 [flash_attn.py:587] Using FlashAttention version 2 (EngineCore_DP0 pid=728) :1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. (EngineCore_DP0 pid=728) :1297: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. (EngineCore_DP0 pid=728) INFO 03-19 00:27:32 [weight_utils.py:561] Time spent downloading weights for Qwen/Qwen2.5-0.5B-Instruct: 3.887840 seconds (EngineCore_DP0 pid=728) INFO 03-19 00:27:33 [weight_utils.py:601] No model.safetensors.index.json found in remote. (EngineCore_DP0 pid=728) Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00