Reference

Supported Models

Large Language Models (LLMs)

Model Name Model Source
Qwen2.5-0.5B https://huggingface.co/Qwen/Qwen2.5-0.5B
Qwen2.5-3B https://huggingface.co/Qwen/Qwen2.5-3B-Instruct
Qwen2.5-7B https://huggingface.co/Qwen/Qwen2.5-7B-Instruct
Qwen3-0.6B https://huggingface.co/Qwen/Qwen3-0.6B
Qwen3-1.7B https://huggingface.co/Qwen/Qwen3-1.7B
Qwen3-4B https://huggingface.co/Qwen/Qwen3-4B
Qwen3-8B https://huggingface.co/Qwen/Qwen3-8B
HY-MT1.5-1.8B https://huggingface.co/tencent/HY-MT1.5-1.8B
Youtu-LLM-2B https://huggingface.co/tencent/Youtu-LLM-2B
GLM-Edge-1.5B-Chat https://modelscope.cn/models/ZhipuAI/glm-edge-1.5b-chat

Vision-Language Multimodal Models (VLMs)

Model Name Model Source
Qwen2.5-VL-3B https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct
Qwen2.5-VL-7B https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct
Qwen2.5-Omni-3B (Thinker) https://huggingface.co/Qwen/Qwen2.5-Omni-3B
Qwen3-VL-2B https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct
Qwen3-VL-4B https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct
FastVLM https://github.com/apple/ml-fastvlm
InternVL3-2B https://huggingface.co/OpenGVLab/InternVL3-2B
InternVL3_5-4B https://huggingface.co/OpenGVLab/InternVL3_5-4B-Instruct
MiMo-VL-7B-RL https://huggingface.co/XiaomiMiMo/MiMo-VL-7B-RL
Gemma-4-E2B https://huggingface.co/google/gemma-4-E2B-it
Gemma-4-E4B https://huggingface.co/google/gemma-4-E4B-it
SmolVLM-500M-Instruct https://huggingface.co/HuggingFaceTB/SmolVLM-500M-Instruct
SmolVLM2-500M-Video-Instruct https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct
UI-TARS-2B-SFT https://huggingface.co/ByteDance-Seed/UI-TARS-2B-SFT
PaddleOCR VL https://huggingface.co/PaddlePaddle/PaddleOCR-VL

Retrieval / Embedding Models

Model Name Model Source
Qwen3-Reranker-0.6B https://huggingface.co/Qwen/Qwen3-Reranker-0.6B
Qwen3-Reranker-4B https://huggingface.co/Qwen/Qwen3-Reranker-4B
Qwen3-Embedding-0.6B https://huggingface.co/Qwen/Qwen3-Embedding-0.6B
Qwen3-Embedding-4B https://huggingface.co/Qwen/Qwen3-Embedding-4B
gme-Qwen2-VL-2B-Instruct https://huggingface.co/Alibaba-NLP/gme-Qwen2-VL-2B-Instruct

Speech Models (ASR / TTS)

Model Name Model Source
Qwen3-ASR-0.6B https://huggingface.co/Qwen/Qwen3-ASR-0.6B
Qwen3-TTS-12Hz-1.7B https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base
VITS https://github.com/jaywalnut310/vits
Whisper https://huggingface.co/openai/whisper-large-v3
SenseVoiceSmall https://modelscope.cn/models/iic/SenseVoiceSmall
Zipformer https://huggingface.co/pfluo/k2fsa-zipformer-chinese-english-mixed

Image Feature / Vision Encoder Models

Model Name Model Source
SigLIP https://huggingface.co/google/siglip-so400m-patch14-384
Siglip2-so400m https://huggingface.co/google/siglip2-so400m-patch14-384
MetaCLIP2 https://huggingface.co/facebook/metaclip-2-worldwide-m16-384
Dinov3 https://huggingface.co/facebook/dinov3-vits16-pretrain-lvd1689m
Depth-Anything-V2-small https://huggingface.co/depth-anything/Depth-Anything-V2-Small
GR00T-N1.6-3B https://huggingface.co/nvidia/GR00T-N1.6-3B

Traditional CNN Vision Models (Classification / Detection)

Model Name Model Source
MobilenetV1 https://ftrg.zbox.filez.com/v2/delivery/data/95f00b0fc900458ba134f8b180b3f7a1/examples/mobilenet_v1/mobilenet_v1_1.0_224.tflite
MobilenetV2 https://ftrg.zbox.filez.com/v2/delivery/data/95f00b0fc900458ba134f8b180b3f7a1/examples/mobilenet/mobilenetv2-12.onnx
Resnet50V2 https://ftrg.zbox.filez.com/v2/delivery/data/95f00b0fc900458ba134f8b180b3f7a1/examples/resnet/resnet50-v2-7.onnx
YOLOv5s https://ftrg.zbox.filez.com/v2/delivery/data/95f00b0fc900458ba134f8b180b3f7a1/examples/yolov5/yolov5s_rknn3.onnx
YOLOv6s https://ftrg.zbox.filez.com/v2/delivery/data/95f00b0fc900458ba134f8b180b3f7a1/examples/yolov6/yolov6s_rknn3.onnx
YOLOv8s https://ftrg.zbox.filez.com/v2/delivery/data/95f00b0fc900458ba134f8b180b3f7a1/examples/yolov8/yolov8s_rknn3.onnx

Model Performance

LLM Performance

Model Name Accelerator Chip TTFT (ms) TPOT (ms) Decode TPS
Qwen2.5-0.5B RK182X 22.74 4.48 223.40
Qwen2.5-1.5B RK182X 49.14 6.69 149.39
Qwen2.5-3B RK182X 85.54 9.69 103.24
Qwen2.5-7B RK1828 162.25 14.19 70.47
Qwen3-0.6B RK182X 28.61 5.49 182.26
Qwen3-1.7B RK1828 54.34 7.17 139.39
Qwen3-4B RK1828 109.78 11.30 88.47
Qwen3-8B RK1828 182.20 16.30 61.34

Test Conditions: Input Tokens = 128, New Tokens = 128

VLM Performance

Model Accelerator Chip Vision Resolution Vision Latency (ms) LLM TTFT (ms) LLM Decode TPS
FastVLM_1.5B_stage3 RK182X 512 × 512 168.85 49.83 151.01
InternVL3-2B RK182X 448 × 448 184.19 49.85 147.62
InternVL3_5-4B RK1828 448 × 448 176.99 110.06 87.95
Qwen2.5-VL-3B RK182X 392 × 392 231.4 97.85 51.48
Qwen2.5-VL-3B RK1828 392 × 392 212.28 87.6 104.05
Qwen2.5-VL-7B RK1828 392 × 392 215.63 163.5 69.95
Qwen3-VL-2B RK182X 384 × 384 114.38 56.55 142.00
Qwen3-VL-4B RK1828 384 × 384 117.6 111.45 87.8
MiMo-VL-7B-RL RK1828 392 × 392 216.56 173.59 64.97
MiniCPM_V_4 RK1828 448 × 448 236.67 97.81 106.56

Omni-Modal Models

Model Accelerator Chip Vision Resolution Vision Latency (ms) Audio Latency (ms) LLM TTFT (ms) LLM Decode TPS
Qwen2.5-Omni-3B RK1828 392×392 220.01 93.60 169.93 104.01
Gemma-4-E2B RK1828 384 × 384 62.20 103.98 99.41 70.19
Gemma-4-E4B RK1828 384 × 384 77.72 119.82 169.93 51.02

CNN Model Performance

Model Name Accelerator Chip Resolution Single-Core FPS Multi-Batch Multi-Core FPS
MobilenetV1 RK182X 224 × 224 388.41 1501.34
MobilenetV2 RK182X 224 × 224 279.67 1290.93
Resnet50V2 RK182X 224 × 224 112.58 843.24
YOLOv5s RK182X 640 × 640 34.54 214.49
YOLOv6s RK182X 640 × 640 30.73 203.46
YOLOv8s RK182X 640 × 640 33.01 212.32

Performance Notes

  1. RK182X covers RK1820 / RK1828.
  2. Qwen2.5-VL-3B: RK1820 adopts two-stage inference (LMHead runs on RK3588); RK1828 runs full inference on co-processor.
  3. NPU clock frequency of RK1820/RK1828: 1 GHz.
  4. Test Platform: RK3588 + RK1820/RK1828 via PCIe; RK3588 set to performance mode.
  5. TTFT: Time To First Token; TPOT: Time Per Output Token; TPS: Tokens Per Second.
  6. VLM vision encoder and LLM are tested separately; LLM input & output token count = 128.

Model Accuracy

LLM Accuracy

Model Name Accelerator Chip Dataset Original float32 Accuracy RKNN3 Model (W4A16 G32)
Qwen2.5-0.5B RK182X gsm8k 40.71 36.09
Qwen2.5-3B RK182X gsm8k 79.91 80.67
Qwen3-4B RK1828 gsm8k 90.6 89.84

VLM Accuracy

Model Name Original float32 Model RKNN3 Model (W4A16 G32)
FastVLM_1.6B 58.42 60.48
Qwen2.5-VL-3B 76.8 75.43
Qwen2.5-VL-7B 79.98 81.19
InternVL3_2B 77.23 72.51
InternVL3_5-4B 78.69 77.75
mimo_vl_7b 74.7 69.85

CNN Classification Accuracy

Model Name Dataset Original Model (TOP-1) Original Model (TOP-5) RKNN3 W8A8 (TOP-1) RKNN3 W8A8 (TOP-5)
MobilenetV1 ImageNet 0.677 0.877 0.676 0.876
MobilenetV2 ImageNet 0.694 0.888 0.680 0.882
Resnet50V2 ImageNet 0.729 0.911 0.721 0.906

Object Detection Model Accuracy

Model Name Dataset Original Model [email protected]:0.95 Original Model [email protected] RKNN3 W8A8 [email protected]:0.95 RKNN3 W8A8 [email protected]
Yolov5s COCO2017 0.326 0.481 0.310 0.471
Yolov6s COCO2017 0.403 0.551 0.385 0.534
Yolov8s COCO2017 0.39 0.525 0.380 0.513

Glossary Standardization

  • TTFT = Time To First Token
  • TPOT = Time Per Output Token
  • TPS = Tokens Per Second
  • FPS = Frames Per Second
  • W4A16 / W8A8: Weight 4-bit / Activation 16-bit; Weight 8-bit / Activation 8-bit quantization
  • AP = Average Precision
  • TOP-1 / TOP-5: Top-k classification accuracy
  • ASR = Automatic Speech Recognition
  • TTS = Text-to-Speech
  • VLM = Vision-Language Model
  • Omni-Modal = Multi-modal model supporting vision + audio + text
  • Co-processor: NPU accelerator chip (RK1820/RK1828)
  • PCIe = Peripheral Component Interconnect Express

Was this article helpful?

TOP