Reference

Supported Models

Model Name	Model Source
Qwen2.5-0.5B	https://huggingface.co/Qwen/Qwen2.5-0.5B
Qwen2.5-3B	https://huggingface.co/Qwen/Qwen2.5-3B-Instruct
Qwen2.5-7B	https://huggingface.co/Qwen/Qwen2.5-7B-Instruct
Qwen3-0.6B	https://huggingface.co/Qwen/Qwen3-0.6B
Qwen3-1.7B	https://huggingface.co/Qwen/Qwen3-1.7B
Qwen3-4B	https://huggingface.co/Qwen/Qwen3-4B
Qwen3-8B	https://huggingface.co/Qwen/Qwen3-8B
HY-MT1.5-1.8B	https://huggingface.co/tencent/HY-MT1.5-1.8B
Youtu-LLM-2B	https://huggingface.co/tencent/Youtu-LLM-2B
GLM-Edge-1.5B-Chat	https://modelscope.cn/models/ZhipuAI/glm-edge-1.5b-chat

Model Name	Model Source
Qwen2.5-VL-3B	https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct
Qwen2.5-VL-7B	https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct
Qwen2.5-Omni-3B (Thinker)	https://huggingface.co/Qwen/Qwen2.5-Omni-3B
Qwen3-VL-2B	https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct
Qwen3-VL-4B	https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct
FastVLM	https://github.com/apple/ml-fastvlm
InternVL3-2B	https://huggingface.co/OpenGVLab/InternVL3-2B
InternVL3_5-4B	https://huggingface.co/OpenGVLab/InternVL3_5-4B-Instruct
MiMo-VL-7B-RL	https://huggingface.co/XiaomiMiMo/MiMo-VL-7B-RL
Gemma-4-E2B	https://huggingface.co/google/gemma-4-E2B-it
Gemma-4-E4B	https://huggingface.co/google/gemma-4-E4B-it
SmolVLM-500M-Instruct	https://huggingface.co/HuggingFaceTB/SmolVLM-500M-Instruct
SmolVLM2-500M-Video-Instruct	https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct
UI-TARS-2B-SFT	https://huggingface.co/ByteDance-Seed/UI-TARS-2B-SFT
PaddleOCR VL	https://huggingface.co/PaddlePaddle/PaddleOCR-VL

Model Name	Model Source
Qwen3-Reranker-0.6B	https://huggingface.co/Qwen/Qwen3-Reranker-0.6B
Qwen3-Reranker-4B	https://huggingface.co/Qwen/Qwen3-Reranker-4B
Qwen3-Embedding-0.6B	https://huggingface.co/Qwen/Qwen3-Embedding-0.6B
Qwen3-Embedding-4B	https://huggingface.co/Qwen/Qwen3-Embedding-4B
gme-Qwen2-VL-2B-Instruct	https://huggingface.co/Alibaba-NLP/gme-Qwen2-VL-2B-Instruct

Model Name	Model Source
Qwen3-ASR-0.6B	https://huggingface.co/Qwen/Qwen3-ASR-0.6B
Qwen3-TTS-12Hz-1.7B	https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base
VITS	https://github.com/jaywalnut310/vits
Whisper	https://huggingface.co/openai/whisper-large-v3
SenseVoiceSmall	https://modelscope.cn/models/iic/SenseVoiceSmall
Zipformer	https://huggingface.co/pfluo/k2fsa-zipformer-chinese-english-mixed

Model Name	Model Source
SigLIP	https://huggingface.co/google/siglip-so400m-patch14-384
Siglip2-so400m	https://huggingface.co/google/siglip2-so400m-patch14-384
MetaCLIP2	https://huggingface.co/facebook/metaclip-2-worldwide-m16-384
Dinov3	https://huggingface.co/facebook/dinov3-vits16-pretrain-lvd1689m
Depth-Anything-V2-small	https://huggingface.co/depth-anything/Depth-Anything-V2-Small
GR00T-N1.6-3B	https://huggingface.co/nvidia/GR00T-N1.6-3B

Model Name	Model Source
MobilenetV1	https://ftrg.zbox.filez.com/v2/delivery/data/95f00b0fc900458ba134f8b180b3f7a1/examples/mobilenet_v1/mobilenet_v1_1.0_224.tflite
MobilenetV2	https://ftrg.zbox.filez.com/v2/delivery/data/95f00b0fc900458ba134f8b180b3f7a1/examples/mobilenet/mobilenetv2-12.onnx
Resnet50V2	https://ftrg.zbox.filez.com/v2/delivery/data/95f00b0fc900458ba134f8b180b3f7a1/examples/resnet/resnet50-v2-7.onnx
YOLOv5s	https://ftrg.zbox.filez.com/v2/delivery/data/95f00b0fc900458ba134f8b180b3f7a1/examples/yolov5/yolov5s_rknn3.onnx
YOLOv6s	https://ftrg.zbox.filez.com/v2/delivery/data/95f00b0fc900458ba134f8b180b3f7a1/examples/yolov6/yolov6s_rknn3.onnx
YOLOv8s	https://ftrg.zbox.filez.com/v2/delivery/data/95f00b0fc900458ba134f8b180b3f7a1/examples/yolov8/yolov8s_rknn3.onnx

Model Name	Accelerator Chip	TTFT (ms)	TPOT (ms)	Decode TPS
Qwen2.5-0.5B	RK182X	22.74	4.48	223.40
Qwen2.5-1.5B	RK182X	49.14	6.69	149.39
Qwen2.5-3B	RK182X	85.54	9.69	103.24
Qwen2.5-7B	RK1828	162.25	14.19	70.47
Qwen3-0.6B	RK182X	28.61	5.49	182.26
Qwen3-1.7B	RK1828	54.34	7.17	139.39
Qwen3-4B	RK1828	109.78	11.30	88.47
Qwen3-8B	RK1828	182.20	16.30	61.34

Test Conditions: Input Tokens = 128, New Tokens = 128

Model	Accelerator Chip	Vision Resolution	Vision Latency (ms)	LLM TTFT (ms)	LLM Decode TPS
FastVLM_1.5B_stage3	RK182X	512 × 512	168.85	49.83	151.01
InternVL3-2B	RK182X	448 × 448	184.19	49.85	147.62
InternVL3_5-4B	RK1828	448 × 448	176.99	110.06	87.95
Qwen2.5-VL-3B	RK182X	392 × 392	231.4	97.85	51.48
Qwen2.5-VL-3B	RK1828	392 × 392	212.28	87.6	104.05
Qwen2.5-VL-7B	RK1828	392 × 392	215.63	163.5	69.95
Qwen3-VL-2B	RK182X	384 × 384	114.38	56.55	142.00
Qwen3-VL-4B	RK1828	384 × 384	117.6	111.45	87.8
MiMo-VL-7B-RL	RK1828	392 × 392	216.56	173.59	64.97
MiniCPM_V_4	RK1828	448 × 448	236.67	97.81	106.56

Model	Accelerator Chip	Vision Resolution	Vision Latency (ms)	Audio Latency (ms)	LLM TTFT (ms)	LLM Decode TPS
Qwen2.5-Omni-3B	RK1828	392×392	220.01	93.60	169.93	104.01
Gemma-4-E2B	RK1828	384 × 384	62.20	103.98	99.41	70.19
Gemma-4-E4B	RK1828	384 × 384	77.72	119.82	169.93	51.02

Model Name	Accelerator Chip	Resolution	Single-Core FPS	Multi-Batch Multi-Core FPS
MobilenetV1	RK182X	224 × 224	388.41	1501.34
MobilenetV2	RK182X	224 × 224	279.67	1290.93
Resnet50V2	RK182X	224 × 224	112.58	843.24
YOLOv5s	RK182X	640 × 640	34.54	214.49
YOLOv6s	RK182X	640 × 640	30.73	203.46
YOLOv8s	RK182X	640 × 640	33.01	212.32

RK182X covers RK1820 / RK1828.
Qwen2.5-VL-3B: RK1820 adopts two-stage inference (LMHead runs on RK3588); RK1828 runs full inference on co-processor.
NPU clock frequency of RK1820/RK1828: 1 GHz.
Test Platform: RK3588 + RK1820/RK1828 via PCIe; RK3588 set to performance mode.
TTFT: Time To First Token; TPOT: Time Per Output Token; TPS: Tokens Per Second.
VLM vision encoder and LLM are tested separately; LLM input & output token count = 128.

Model Name	Accelerator Chip	Dataset	Original float32 Accuracy	RKNN3 Model (W4A16 G32)
Qwen2.5-0.5B	RK182X	gsm8k	40.71	36.09
Qwen2.5-3B	RK182X	gsm8k	79.91	80.67
Qwen3-4B	RK1828	gsm8k	90.6	89.84

Model Name	Dataset	Original Model (TOP-1)	Original Model (TOP-5)	RKNN3 W8A8 (TOP-1)	RKNN3 W8A8 (TOP-5)
MobilenetV1	ImageNet	0.677	0.877	0.676	0.876
MobilenetV2	ImageNet	0.694	0.888	0.680	0.882
Resnet50V2	ImageNet	0.729	0.911	0.721	0.906

Model Name	Dataset	Original Model [email protected]:0.95	Original Model [email protected]	RKNN3 W8A8 [email protected]:0.95	RKNN3 W8A8 [email protected]
Yolov5s	COCO2017	0.326	0.481	0.310	0.471
Yolov6s	COCO2017	0.403	0.551	0.385	0.534
Yolov8s	COCO2017	0.39	0.525	0.380	0.513

TTFT = Time To First Token
TPOT = Time Per Output Token
TPS = Tokens Per Second
FPS = Frames Per Second
W4A16 / W8A8: Weight 4-bit / Activation 16-bit; Weight 8-bit / Activation 8-bit quantization
AP = Average Precision
TOP-1 / TOP-5: Top-k classification accuracy
ASR = Automatic Speech Recognition
TTS = Text-to-Speech
VLM = Vision-Language Model
Omni-Modal = Multi-modal model supporting vision + audio + text
Co-processor: NPU accelerator chip (RK1820/RK1828)
PCIe = Peripheral Component Interconnect Express