RK3576 AI Example: Setup & Guide

Discover the RK3576 AI Example, featuring precompiled RKNN demos on Linux. Utilize dual NPU cores for real-time object detection and instance segmentation. Learn to execute demos with specified model paths and camera devices, supporting tasks like speech recognition and text-to-speech. Perfect for developers exploring AI capabilities on the RK3576 platform.

RKNN Demo User Guide

This document describes how to use the precompiled RKNN demo programs on the RK3576 Linux platform.

Directory Structure

dfrobot-ai-demo/
├── rknn_yolov5_demo/                  # YOLOv5 Object Detection
├── rknn_yolov5_seg_demo/              # YOLOv5 Object Detection + Instance Segmentation
├── rknn_yolov8_demo/                  # YOLOv8 Object Detection
├── rknn_yolov8_seg_demo/              # YOLOv8 Object Detection + Instance Segmentation
├── rknn_yolov8_pose_demo/             # YOLOv8 Human Pose Estimation
├── rknn_yolo11_demo/                  # YOLO11 Object Detection
├── rknn_clip_demo/                    # CLIP Image-Text Retrieval
├── rknn_whisper_demo/                 # Whisper Speech Recognition
├── rknn_zipformer_demo/               # Zipformer Speech Recognition
├── rknn_mms_tts_demo/                 # MMS-TTS Text-to-Speech
├── rknn_RetinaFace_demo/              # RetinaFace Face Detection
└── rknn_PPOCR-System_demo/            # PPOCR Optical Character Recognition System
└── rknn_llm_multimodal_model_demo/    # Qwen2-VL-2B Multimodal Model

General Usage Instructions

Environment Preparation

Hardware platform: RK3576 Linux aarch64 system
Demo path: /dfrobot-ai-demo
Dependencies: All required dynamic libraries for the demos are included in their respective lib/ directories
Model files: Corresponding RKNN model files must be placed in the model/ directory

How to Run

Enter the corresponding demo directory and run the executable directly:

cd rknn_xxx_demo
./rknn_xxx_demo [parameters]

View Help

./rknn_xxx_demo

Detailed Introduction to Demos

YOLOv5 Object Detection (rknn_yolov5_demo)

Function: Real-time object detection, supporting detection of 80 categories of objects from the COCO dataset.

Key Features:

Supports real-time detection via camera (utilizes the ISP function of RK3576, using two video devices with different formats enumerated by the same camera)
Multi-threaded inference accelerated by dual NPU cores
DRM display output

Usage:

cd rknn_yolov5_demo
# Use the default camera device
./rknn_yolov5_demo model/yolov5.rknn

# Specify camera devices
./rknn_yolov5_demo model/yolov5.rknn /dev/video11 /dev/video12 connector_id

Parameter Description:

model_path: Path to the RKNN model file
nv12_device: NV12 format camera device (optional, default: /dev/video11)
rgb_device: RGB format camera device (optional, default: /dev/video12)
connector_id: DRM connector ID (optional, default: 0, auto-selected)

Exit: Press Ctrl+C to view statistical information (number of inference frames, running time, average FPS).

YOLOv5 Instance Segmentation (rknn_yolov5_seg_demo)

Function: Object detection + instance segmentation, capable of detecting objects and generating segmentation masks simultaneously.

Key Features:

Real-time object detection and segmentation
Color mask overlay display
Multi-threaded inference optimization
Fixed display resolution of 720P

Usage:

cd rknn_yolov5_seg_demo
# Use the default camera device
./rknn_yolov5_seg_demo model/yolov5_seg.rknn

# Specify camera devices and display device
./rknn_yolov5_seg_demo model/yolov5_seg.rknn [nv12_device] [rgb_device] [connector_id]

Parameter Description:

model_path: Path to the RKNN model file
nv12_device: NV12 format camera device (optional, default: /dev/video11)
rgb_device: RGB format camera device (optional, default: /dev/video12)
connector_id: DRM connector ID (optional, default: 0, auto-selected)

Exit: Press Ctrl+C to view statistical information (number of inference frames, running time, average FPS).

YOLOv8 Object Detection (rknn_yolov8_demo)

Function: YOLOv8 real-time object detection with performance superior to YOLOv5.

Key Features:

Higher detection accuracy and speed
Dual NPU core parallel inference
Real-time display of detection results

Usage:

cd rknn_yolov8_demo
# Use the default camera device
./rknn_yolov8_demo model/yolov8n.rknn

# Specify camera devices and display device
./rknn_yolov8_demo model/yolov8n.rknn [nv12_device] [rgb_device] [connector_id]

Parameter Description:

model_path: Path to the RKNN model file
nv12_device: NV12 format camera device (optional, default: /dev/video11)
rgb_device: RGB format camera device (optional, default: /dev/video12)
connector_id: DRM connector ID (optional, default: 0, auto-selected)

Exit: Press Ctrl+C to view statistical information (number of inference frames, running time, average FPS).

YOLOv8 Instance Segmentation (rknn_yolov8_seg_demo)

Function: YOLOv8 object detection + instance segmentation.

Key Features:

High-precision object detection and segmentation
Optimized segmentation mask rendering
Multi-threaded post-processing acceleration
Fixed display resolution of 720P

Usage:

cd rknn_yolov8_seg_demo
# Use the default camera device
./rknn_yolov8_seg_demo model/yolov8_seg.rknn

# Specify camera devices and display device
./rknn_yolov8_seg_demo model/yolov8_seg.rknn [nv12_device] [rgb_device] [connector_id]

Parameter Description:

model_path: Path to the RKNN model file
nv12_device: NV12 format camera device (optional, default: /dev/video11)
rgb_device: RGB format camera device (optional, default: /dev/video12)
connector_id: DRM connector ID (optional, default: 0, auto-selected)

Exit: Press Ctrl+C to view statistical information (number of inference frames, running time, average FPS).

YOLOv8 Human Pose Estimation (rknn_yolov8_pose_demo)

Function: Real-time human pose estimation, detecting human keypoints and skeletons.

Key Features:

Detects 17 keypoints: nose, eyes, ears, shoulders, elbows, wrists, hips, knees, ankles, etc.
Automatically draws skeleton connection lines
Real-time pose visualization

Usage:

cd rknn_yolov8_pose_demo
# Use the default camera device
./rknn_yolov8_pose_demo model/yolov8_pose.rknn

# Specify camera devices and display device
./rknn_yolov8_pose_demo model/yolov8_pose.rknn [nv12_device] [rgb_device] [connector_id]

Parameter Description:

model_path: Path to the RKNN model file
nv12_device: NV12 format camera device (optional, default: /dev/video11)
rgb_device: RGB format camera device (optional, default: /dev/video12)
connector_id: DRM connector ID (optional, default: 0, auto-selected)

Exit: Press Ctrl+C to view statistical information (number of inference frames, running time, average FPS).

YOLO11 Object Detection (rknn_yolo11_demo)

Function: YOLO11 object detection.

Key Features:

Higher detection accuracy
Optimized inference performance

Usage:

cd rknn_yolo11_demo
# Use the default camera device
./rknn_yolo11_demo model/yolo11.rknn

# Specify camera devices and display device
./rknn_yolo11_demo model/yolo11n.rknn [nv12_device] [rgb_device] [connector_id]

Parameter Description:

model_path: Path to the RKNN model file
nv12_device: NV12 format camera device (optional, default: /dev/video11)
rgb_device: RGB format camera device (optional, default: /dev/video12)
connector_id: DRM connector ID (optional, default: 0, auto-selected)

Exit: Press Ctrl+C to view statistical information (number of inference frames, running time, average FPS).

CLIP Image-Text Matching (rknn_clip_demo)

Function: Uses the CLIP model to perform similarity matching between images and text.

Key Features:

Interactive image search
Text-based image library query

Parameter Description

$ ./rknn_clip_demo_search
Usage: ./rknn_clip_demo_search <image_model_path> <text_model_path> <image_dir>
Example: ./rknn_clip_demo_search ./model/clip_images.rknn ./model/clip_text.rknn ./images

Interactive mode: Enter text query to search for matching images, or 'quit' to exit.

image_model_path: Path to the image encoding model
text_model_path: Path to the text encoding model
image_dir: Path to the image directory

Usage Example

$ cd rknn_clip_demo
# Interactive usage
# Enter text query, e.g.: "a cat"
# Enter "quit" or "exit" to exit
$ ./rknn_clip_demo_search model/clip_images.rknn model/clip_text.rknn images/
=== Ready for Interactive Search ===
Enter text query to search for matching images (or 'quit' to exit):
> a cat

Searching for: "a cat"
Processing 3 images...
rga_api version 1.10.1_[0]

=== Search Results ===
Processed 3 images
Best match:
  Path:  images//cat.jpg
  Score: 0.262

> quit
Exiting...

Whisper Speech Recognition (rknn_whisper_demo)

Function: Uses the Whisper model for speech recognition, supporting both Chinese and English.

Key Features:

Supports real-time audio recording via microphone
Chinese and English speech recognition
High-precision transcription

Parameter Description

$ cd rknn_whisper_demo
$ ./rknn_whisper_demo_mic
Usage: ./rknn_whisper_demo_mic <encoder_path> <decoder_path> <task> [duration_sec] [alsa_device]
  encoder_path:   Path to encoder RKNN model
  decoder_path:   Path to decoder RKNN model
  task:           Recognition task: "en" (English) or "zh" (Chinese)
  duration_sec:   Audio capture duration in seconds (default: 20, max: 20)
  alsa_device:    ALSA device name (default: "plughw:0,0")
                  Recommended: "plughw:0,0" (auto format conversion)
                  Alternatives: "default", "hw:0,0" (direct hardware)

Note: If only 4 arguments are provided and the 4th argument looks like an ALSA device
      (starts with 'plughw:', 'hw:', or 'default'), it will be treated as alsa_device.

encoder_path: Path to the encoder model
decoder_path: Path to the decoder model
task: en for English recognition, zh for Chinese recognition
duration_sec: Audio recording duration (seconds), default: 20 seconds
alsa_device: ALSA device, default: "plughw:0,0"

Usage Example

# English recognition with 5-second recording
./rknn_whisper_demo_mic model/whisper_encoder_base_20s.rknn model/whisper_decoder_base_20s.rknn en 5

# Chinese recognition with specified microphone device
./rknn_whisper_demo_mic model/whisper_encoder_base_20s.rknn model/whisper_decoder_base_20s.rknn zh 10 plughw:0,0

Output: The recognition results will be displayed in the terminal.

Zipformer Speech Recognition (rknn_zipformer_demo)

Function: Uses the Zipformer model for speech recognition.

Key Features:

High-efficiency speech recognition model
Supports microphone input
Real-time recognition

Parameter Description

$ ./rknn_zipformer_demo_mic
Usage: ./rknn_zipformer_demo_mic <encoder_path> <decoder_path> <joiner_path> [duration_sec] [alsa_device]
  encoder_path:   Path to encoder RKNN model
  decoder_path:   Path to decoder RKNN model
  joiner_path:    Path to joiner RKNN model
  duration_sec:   Audio capture duration in seconds (default: 5.0)
  alsa_device:    ALSA device name (default: "plughw:0,0")
                  Recommended: "plughw:0,0" (auto format conversion)
                  Alternatives: "default", "hw:0,0" (direct hardware)

encoder_path: Path to the encoder model
decoder_path: Path to the decoder model
joiner_path: Path to the joiner model
duration_sec: Audio recording duration (seconds)
alsa_device: ALSA device, default: "plughw:0,0"

Usage Example

# Default 5-second recording
./rknn_zipformer_demo_mic model/encoder-epoch-99-avg-1.rknn model/decoder-epoch-99-avg-1.rknn model/joiner-epoch-99-avg-1.rkn

# Specify recording duration and device
./rknn_zipformer_demo_mic model/encoder-epoch-99-avg-1.rknn model/decoder-epoch-99-avg-1.rknn model/joiner-epoch-99-avg-1.rkn 10 plughw:0,0

MMS-TTS Text-to-Speech (rknn_mms_tts_demo)

Function: Converts text to speech and plays it back.

Key Features:

Interactive text input
Playback via onboard speaker

Parameter Description

$ ./rknn_mms_tts_demo_interactive
Usage: ./rknn_mms_tts_demo_interactive <encoder_path> <decoder_path> [alsa_device]
Example: ./rknn_mms_tts_demo_interactive model/mms_tts_eng_encoder_200.rknn model/mms_tts_eng_decoder_200.rknn plughw:0,0

Interactive mode: Enter text to convert to speech, or 'quit' to exit.

encoder_path: Path to the encoder model
decoder_path: Path to the decoder model
alsa_device: ALSA device (default: "plughw:0,0")

Usage Example

$ ./rknn_mms_tts_demo_interactive model/mms_tts_eng_encoder_200.rknn model/mms_tts_eng_decoder_200.rknn plughw:0,0
> The weather is really nice today!

Inference time: 0.502 sec, Audio length: 1.072 sec, RTF: 0.469
Playing audio...
Target sample rate: 44100 Hz, actual: 44100 Hz (requested: 16000 Hz)
Using 2 channels (stereo)
Resampling from 16000 Hz to 44100 Hz
Converting from 1 channels to 2 channels
Audio device opened successfully: plughw:0,0
Sample rate: 44100 Hz, Channels: 2, Format: S16_LE
Period size: 512 frames, Buffer size: 2048 frames
Audio data: 47275 frames, 1.07 seconds
Audio data range: [-0.0020, 0.0015] (first 1000 samples)
Warning: Audio data appears to be very quiet or silent!
Playback already running after writing 512 frames
Total frames written: 47275 / 47275
Draining PCM buffer...
PCM state before drain: 3 (0=SND_PCM_STATE_OPEN, 1=RUNNING, 2=XRUN, 3=DRAINING, 4=PAUSED, 5=SUSPENDED, 6=DISCONNECTED, 7=SETUP, 8=PREPARED)
PCM is running, waiting for playback to complete...
Final PCM state: 1
Audio playback complete.

> quit
Exiting...

Notes

Text length is limited to 200 characters.

RetinaFace Face Detection (rknn_RetinaFace_demo)

Function: High-precision face detection that can detect multiple faces and label keypoints.

Key Features:

Real-time face detection
Facial keypoint detection (5 points)
Multi-threaded optimization

Usage:

cd rknn_RetinaFace_demo
# Use the default camera device
./rknn_retinaface_demo model/RetinaFace.rknn

# Specify camera devices and display device
./rknn_retinaface_demo model/RetinaFace.rknn [nv12_device] [rgb_device] [connector_id]

Parameter Description:

model_path: Path to the RKNN model file
nv12_device: NV12 format camera device (optional, default: /dev/video11)
rgb_device: RGB format camera device (optional, default: /dev/video12)
connector_id: DRM connector ID (optional, default: 0, auto-selected)

Exit: Press Ctrl+C to view statistical information (number of inference frames, running time, average FPS).

PPOCR Optical Character Recognition System (rknn_PPOCR-System_demo)

Function: OCR (Optical Character Recognition) that can detect and recognize text in images.

Key Features:

Text detection + text recognition
Supports Chinese and English recognition
Real-time OCR processing

Usage:

cd rknn_PPOCR-System_demo
# Use the default camera device
./rknn_ppocr_system_demo model/ppocrv4_det.rknn model/ppocrv4_rec.rknn
# Specify camera devices and display device
./rknn_ppocr_system_demo model/ppocrv4_det.rknn model/ppocrv4_rec.rknn [nv12_device] [rgb_device] [connector_id]

Parameter Description

det_model_path: Path to the text detection model
rec_model_path: Path to the text recognition model
model_path: Path to the RKNN model file
nv12_device: NV12 format camera device (optional, default: /dev/video11)
rgb_device: RGB format camera device (optional, default: /dev/video12)
connector_id: DRM connector ID (optional, default: 0, auto-selected)

Exit: Press Ctrl+C to view statistical information (number of inference frames, running time, average FPS).

Multimodal Large Language Model (rknn_llm_multimodal_model_demo)

Function: This example uses rknn-llm/examples/multimodal_model_demo to demonstrate the deployment of a multimodal model on the RK3576 platform.

Key Features:

Built-in open-source model: Qwen2-VL-2B

Usage Example

cd rknn_llm_multimodal_model_demo
export LD_LIBRARY_PATH=./lib

# Run imgenc
# ./imgenc model/rknn/qwen2-vl-vision_rk3576.rknn demo.jpg 3

# Run demo (Multimodal Example)
./demo demo.jpg model/rknn/qwen2-vl-vision_rk3576.rknn model/rkllm/qwen2-vl-2B-llm_rk3576.rkllm 2048 4096 3 "<|vision_start|>" "<|vision_end|>" "<|image_pad|>"

# Multimodal Example
user: \<image\>What is in the image?
robot: The image depicts an astronaut on the moon, enjoying a beer. The background shows the Earth and stars, creating a surreal and futuristic scene.

# Plain Text Example
user: Who are you?

Common Issues

Camera Device Not Found

Issue: Error message: camera_source_init fail

Solutions:

# List available camera devices
v4l2-ctl --list-devices

# Check device formats
v4l2-ctl --list-formats-ext --device=/dev/video11

Display Device Issues

Issue: Error message: drm_display_init fail

Solutions:

Available displays will be listed automatically when running the demo
Using connector_id=0 will automatically select the first available display
Or specify a specific connector_id

Microphone Device Issues

Issue: Speech recognition demos fail to record audio

Solutions:

# List available audio devices
aplay -l
arecord -l

# Try different ALSA devices:
# plughw:0,0 (recommended, automatic format conversion)
# hw:0,0 (direct hardware access)
# default (system default)

Missing Model Files

Issue: Error message indicating model file does not exist

Solutions:

Ensure model files are placed in the model/ directory
Verify the model file path is correct
Confirm the model file format is .rknn

Dynamic Library Loading Failure

Issue: Error message: error while loading shared libraries

Solutions:

# Ensure the lib directory exists and contains required .so files
# Set LD_LIBRARY_PATH (usually set automatically)
export LD_LIBRARY_PATH=$PWD/lib:$LD_LIBRARY_PATH

Performance Optimization Guide

Multi-Threaded Architecture

Most vision detection demos adopt a multi-threaded architecture:

Camera thread: Captures NV12 and RGB images
Inference thread: Parallel inference using dual NPU cores (Core 0 + Core 1)
Post-processing thread: Dual-thread parallel post-processing and rendering
Display thread: DRM display output

Performance Statistics

Press Ctrl+C while running the demo to view performance statistics:

Total inference frames
Running time
Average inference FPS