Supported Models
Large Language Models (LLMs)
| Model Name |
Model Source |
| Qwen2.5-0.5B |
https://huggingface.co/Qwen/Qwen2.5-0.5B |
| Qwen2.5-3B |
https://huggingface.co/Qwen/Qwen2.5-3B-Instruct |
| Qwen2.5-7B |
https://huggingface.co/Qwen/Qwen2.5-7B-Instruct |
| Qwen3-0.6B |
https://huggingface.co/Qwen/Qwen3-0.6B |
| Qwen3-1.7B |
https://huggingface.co/Qwen/Qwen3-1.7B |
| Qwen3-4B |
https://huggingface.co/Qwen/Qwen3-4B |
| Qwen3-8B |
https://huggingface.co/Qwen/Qwen3-8B |
| HY-MT1.5-1.8B |
https://huggingface.co/tencent/HY-MT1.5-1.8B |
| Youtu-LLM-2B |
https://huggingface.co/tencent/Youtu-LLM-2B |
| GLM-Edge-1.5B-Chat |
https://modelscope.cn/models/ZhipuAI/glm-edge-1.5b-chat |
Vision-Language Multimodal Models (VLMs)
| Model Name |
Model Source |
| Qwen2.5-VL-3B |
https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct |
| Qwen2.5-VL-7B |
https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct |
| Qwen2.5-Omni-3B (Thinker) |
https://huggingface.co/Qwen/Qwen2.5-Omni-3B |
| Qwen3-VL-2B |
https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct |
| Qwen3-VL-4B |
https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct |
| FastVLM |
https://github.com/apple/ml-fastvlm |
| InternVL3-2B |
https://huggingface.co/OpenGVLab/InternVL3-2B |
| InternVL3_5-4B |
https://huggingface.co/OpenGVLab/InternVL3_5-4B-Instruct |
| MiMo-VL-7B-RL |
https://huggingface.co/XiaomiMiMo/MiMo-VL-7B-RL |
| Gemma-4-E2B |
https://huggingface.co/google/gemma-4-E2B-it |
| Gemma-4-E4B |
https://huggingface.co/google/gemma-4-E4B-it |
| SmolVLM-500M-Instruct |
https://huggingface.co/HuggingFaceTB/SmolVLM-500M-Instruct |
| SmolVLM2-500M-Video-Instruct |
https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct |
| UI-TARS-2B-SFT |
https://huggingface.co/ByteDance-Seed/UI-TARS-2B-SFT |
| PaddleOCR VL |
https://huggingface.co/PaddlePaddle/PaddleOCR-VL |
Retrieval / Embedding Models
| Model Name |
Model Source |
| Qwen3-Reranker-0.6B |
https://huggingface.co/Qwen/Qwen3-Reranker-0.6B |
| Qwen3-Reranker-4B |
https://huggingface.co/Qwen/Qwen3-Reranker-4B |
| Qwen3-Embedding-0.6B |
https://huggingface.co/Qwen/Qwen3-Embedding-0.6B |
| Qwen3-Embedding-4B |
https://huggingface.co/Qwen/Qwen3-Embedding-4B |
| gme-Qwen2-VL-2B-Instruct |
https://huggingface.co/Alibaba-NLP/gme-Qwen2-VL-2B-Instruct |
Speech Models (ASR / TTS)
| Model Name |
Model Source |
| Qwen3-ASR-0.6B |
https://huggingface.co/Qwen/Qwen3-ASR-0.6B |
| Qwen3-TTS-12Hz-1.7B |
https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base |
| VITS |
https://github.com/jaywalnut310/vits |
| Whisper |
https://huggingface.co/openai/whisper-large-v3 |
| SenseVoiceSmall |
https://modelscope.cn/models/iic/SenseVoiceSmall |
| Zipformer |
https://huggingface.co/pfluo/k2fsa-zipformer-chinese-english-mixed |
Image Feature / Vision Encoder Models
| Model Name |
Model Source |
| SigLIP |
https://huggingface.co/google/siglip-so400m-patch14-384 |
| Siglip2-so400m |
https://huggingface.co/google/siglip2-so400m-patch14-384 |
| MetaCLIP2 |
https://huggingface.co/facebook/metaclip-2-worldwide-m16-384 |
| Dinov3 |
https://huggingface.co/facebook/dinov3-vits16-pretrain-lvd1689m |
| Depth-Anything-V2-small |
https://huggingface.co/depth-anything/Depth-Anything-V2-Small |
| GR00T-N1.6-3B |
https://huggingface.co/nvidia/GR00T-N1.6-3B |
Traditional CNN Vision Models (Classification / Detection)
| Model Name |
Model Source |
| MobilenetV1 |
https://ftrg.zbox.filez.com/v2/delivery/data/95f00b0fc900458ba134f8b180b3f7a1/examples/mobilenet_v1/mobilenet_v1_1.0_224.tflite |
| MobilenetV2 |
https://ftrg.zbox.filez.com/v2/delivery/data/95f00b0fc900458ba134f8b180b3f7a1/examples/mobilenet/mobilenetv2-12.onnx |
| Resnet50V2 |
https://ftrg.zbox.filez.com/v2/delivery/data/95f00b0fc900458ba134f8b180b3f7a1/examples/resnet/resnet50-v2-7.onnx |
| YOLOv5s |
https://ftrg.zbox.filez.com/v2/delivery/data/95f00b0fc900458ba134f8b180b3f7a1/examples/yolov5/yolov5s_rknn3.onnx |
| YOLOv6s |
https://ftrg.zbox.filez.com/v2/delivery/data/95f00b0fc900458ba134f8b180b3f7a1/examples/yolov6/yolov6s_rknn3.onnx |
| YOLOv8s |
https://ftrg.zbox.filez.com/v2/delivery/data/95f00b0fc900458ba134f8b180b3f7a1/examples/yolov8/yolov8s_rknn3.onnx |
Model Performance
LLM Performance
| Model Name |
Accelerator Chip |
TTFT (ms) |
TPOT (ms) |
Decode TPS |
| Qwen2.5-0.5B |
RK182X |
22.74 |
4.48 |
223.40 |
| Qwen2.5-1.5B |
RK182X |
49.14 |
6.69 |
149.39 |
| Qwen2.5-3B |
RK182X |
85.54 |
9.69 |
103.24 |
| Qwen2.5-7B |
RK1828 |
162.25 |
14.19 |
70.47 |
| Qwen3-0.6B |
RK182X |
28.61 |
5.49 |
182.26 |
| Qwen3-1.7B |
RK1828 |
54.34 |
7.17 |
139.39 |
| Qwen3-4B |
RK1828 |
109.78 |
11.30 |
88.47 |
| Qwen3-8B |
RK1828 |
182.20 |
16.30 |
61.34 |
Test Conditions: Input Tokens = 128, New Tokens = 128
VLM Performance
| Model |
Accelerator Chip |
Vision Resolution |
Vision Latency (ms) |
LLM TTFT (ms) |
LLM Decode TPS |
| FastVLM_1.5B_stage3 |
RK182X |
512 × 512 |
168.85 |
49.83 |
151.01 |
| InternVL3-2B |
RK182X |
448 × 448 |
184.19 |
49.85 |
147.62 |
| InternVL3_5-4B |
RK1828 |
448 × 448 |
176.99 |
110.06 |
87.95 |
| Qwen2.5-VL-3B |
RK182X |
392 × 392 |
231.4 |
97.85 |
51.48 |
| Qwen2.5-VL-3B |
RK1828 |
392 × 392 |
212.28 |
87.6 |
104.05 |
| Qwen2.5-VL-7B |
RK1828 |
392 × 392 |
215.63 |
163.5 |
69.95 |
| Qwen3-VL-2B |
RK182X |
384 × 384 |
114.38 |
56.55 |
142.00 |
| Qwen3-VL-4B |
RK1828 |
384 × 384 |
117.6 |
111.45 |
87.8 |
| MiMo-VL-7B-RL |
RK1828 |
392 × 392 |
216.56 |
173.59 |
64.97 |
| MiniCPM_V_4 |
RK1828 |
448 × 448 |
236.67 |
97.81 |
106.56 |
Omni-Modal Models
| Model |
Accelerator Chip |
Vision Resolution |
Vision Latency (ms) |
Audio Latency (ms) |
LLM TTFT (ms) |
LLM Decode TPS |
| Qwen2.5-Omni-3B |
RK1828 |
392×392 |
220.01 |
93.60 |
169.93 |
104.01 |
| Gemma-4-E2B |
RK1828 |
384 × 384 |
62.20 |
103.98 |
99.41 |
70.19 |
| Gemma-4-E4B |
RK1828 |
384 × 384 |
77.72 |
119.82 |
169.93 |
51.02 |
CNN Model Performance
| Model Name |
Accelerator Chip |
Resolution |
Single-Core FPS |
Multi-Batch Multi-Core FPS |
| MobilenetV1 |
RK182X |
224 × 224 |
388.41 |
1501.34 |
| MobilenetV2 |
RK182X |
224 × 224 |
279.67 |
1290.93 |
| Resnet50V2 |
RK182X |
224 × 224 |
112.58 |
843.24 |
| YOLOv5s |
RK182X |
640 × 640 |
34.54 |
214.49 |
| YOLOv6s |
RK182X |
640 × 640 |
30.73 |
203.46 |
| YOLOv8s |
RK182X |
640 × 640 |
33.01 |
212.32 |
Performance Notes
- RK182X covers RK1820 / RK1828.
- Qwen2.5-VL-3B: RK1820 adopts two-stage inference (LMHead runs on RK3588); RK1828 runs full inference on co-processor.
- NPU clock frequency of RK1820/RK1828: 1 GHz.
- Test Platform: RK3588 + RK1820/RK1828 via PCIe; RK3588 set to performance mode.
- TTFT: Time To First Token; TPOT: Time Per Output Token; TPS: Tokens Per Second.
- VLM vision encoder and LLM are tested separately; LLM input & output token count = 128.
Model Accuracy
LLM Accuracy
| Model Name |
Accelerator Chip |
Dataset |
Original float32 Accuracy |
RKNN3 Model (W4A16 G32) |
| Qwen2.5-0.5B |
RK182X |
gsm8k |
40.71 |
36.09 |
| Qwen2.5-3B |
RK182X |
gsm8k |
79.91 |
80.67 |
| Qwen3-4B |
RK1828 |
gsm8k |
90.6 |
89.84 |
VLM Accuracy
| Model Name |
Original float32 Model |
RKNN3 Model (W4A16 G32) |
| FastVLM_1.6B |
58.42 |
60.48 |
| Qwen2.5-VL-3B |
76.8 |
75.43 |
| Qwen2.5-VL-7B |
79.98 |
81.19 |
| InternVL3_2B |
77.23 |
72.51 |
| InternVL3_5-4B |
78.69 |
77.75 |
| mimo_vl_7b |
74.7 |
69.85 |
CNN Classification Accuracy
| Model Name |
Dataset |
Original Model (TOP-1) |
Original Model (TOP-5) |
RKNN3 W8A8 (TOP-1) |
RKNN3 W8A8 (TOP-5) |
| MobilenetV1 |
ImageNet |
0.677 |
0.877 |
0.676 |
0.876 |
| MobilenetV2 |
ImageNet |
0.694 |
0.888 |
0.680 |
0.882 |
| Resnet50V2 |
ImageNet |
0.729 |
0.911 |
0.721 |
0.906 |
Object Detection Model Accuracy
Glossary Standardization
- TTFT = Time To First Token
- TPOT = Time Per Output Token
- TPS = Tokens Per Second
- FPS = Frames Per Second
- W4A16 / W8A8: Weight 4-bit / Activation 16-bit; Weight 8-bit / Activation 8-bit quantization
- AP = Average Precision
- TOP-1 / TOP-5: Top-k classification accuracy
- ASR = Automatic Speech Recognition
- TTS = Text-to-Speech
- VLM = Vision-Language Model
- Omni-Modal = Multi-modal model supporting vision + audio + text
- Co-processor: NPU accelerator chip (RK1820/RK1828)
- PCIe = Peripheral Component Interconnect Express