Installing Nunchaku for ComfyUI Portable (A "Survivor's" Guide)

This guide is based on a real-world troubleshooting process to get ComfyUI-Nunchaku working seamlessly with a ComfyUI portable installation. Many users face dependency issues, and this aims to help those "affected by the process." To use the single file svdq-int4_r32-flux.1-dev.safetensors

Disclaimer: This guide isn't official. It's a community-driven effort based on extensive troubleshooting. Always back up your files before making changes.

The key innovation here is the application of SVDQuant (https://github.com/mit-han-lab/nunchaku), which optimizes inference for this model far beyond simple quantization, ensuring the preservation of the original flux1-dev quality. Nunchaku isn't just another Flux model accelerator—it's a paradigm shift.

Why this guide?

The official Nunchaku PyPI release can be outdated, and its direct installation can cause dependency conflicts, especially with filterpy and PyTorch versions. This guide focuses on using a specific development release that resolves these issues.

Target Environment:

ComfyUI Portable (with embedded Python)
Python 3.12
PyTorch 2.7.1+cu128 (or similar +cu12x version)

NVIDIA GPU Compatibility Notes:

NVIDIA categorizes GPU compatibility by architecture, not strictly by series numbers.

Blackwell Architecture (expected in RTX 50 series and beyond): This is the architecture that introduces dedicated hardware acceleration for FP4 (via 5th-gen Tensor Cores). Models heavily relying on FP4 for speed will see their full benefits here.
Ada Lovelace (RTX 40 series) & Ampere (RTX 30 series): These architectures are highly capable, featuring Tensor Cores with dedicated support for FP8, BF16, and FP16. However, they do not have specific hardware for native FP4 acceleration. While they can process data quantized in INT4 or FP4, they do so through emulation or by converting the data to a precision they do natively support (like FP16/BF16) before calculation.
Older series (e.g., RTX 20 series or GTX 16 series): Compatibility for advanced features like INT4/FP4 might be limited or nonexistent, often requiring FP32 or FP16.

Step-by-Step Installation Guide:

Close ComfyUI: Ensure your ComfyUI application is completely shut down before starting.
Open your embedded Python's terminal: Navigate to your ComfyUI_windows_portable\python_embeded directory in your command prompt or PowerShell.
Example:
Bash
```
cd E:\ComfyUI_windows_portable\python_embeded
```
Uninstall problematic previous dependencies: This cleans up any prior failed attempts or conflicting versions.
Bash
```
python.exe -m pip uninstall nunchaku insightface facexlib filterpy diffusers accelerate onnxruntime -y
```
(Ignore "Skipping" messages for packages not installed.)
Install the specific Nunchaku development wheel: This is crucial as it's a pre-built package that bypasses common compilation issues and is compatible with PyTorch 2.7 and Python 3.12.
Bash
```
python.exe -m pip install https://github.com/mit-han-lab/nunchaku/releases/download/v0.3.1dev20250609/nunchaku-0.3.1.dev20250609+torch2.7-cp312-cp312-win_amd64.whl
```
(Note: win_amd64 refers to 64-bit Windows, not AMD CPUs. It's correct for Intel CPUs on 64-bit Windows systems).
Install facexlib: After installing the Nunchaku wheel, the facexlib dependency for some optional nodes (like PuLID) might still be missing. Install it directly.
Bash
```
python.exe -m pip install facexlib
```
Install insightface: insightface is another crucial dependency for Nunchaku's facial features. It might not be fully pulled in by the previous steps.
Bash
```
python.exe -m pip install insightface
```
Install onnxruntime: insightface relies on onnxruntime to run ONNX models. Ensure it's installed.
Bash
```
python.exe -m pip install onnxruntime
```
Verify your installation:
- Close the terminal.
- Start ComfyUI via run_nvidia_gpu.bat or run_nvidia_gpu_fast_fp16_accumulation.bat (or your usual start script) from E:\ComfyUI_windows_portable\.
- Check the console output: There should be no ModuleNotFoundError or ImportError messages related to Nunchaku or its dependencies at startup.
- Check ComfyUI GUI: In the ComfyUI interface, click "Add Nodes" and verify that all Nunchaku nodes, including NunchakuPulidApply and NunchakuPulidLoader, are visible and can be added to your workflow. You should see 9 Nunchaku nodes.

Important Notes:

The Nunchaku wheel installer node now included in ComfyUI-Nunchaku can update Nunchaku in the future, simplifying maintenance.
You can find example workflows in the workflows_examples folder located at E:\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-nunchaku\. These JSON files can be loaded directly into ComfyUI to demonstrate how to use Nunchaku's nodes.
While performance optimizations like xformers exist, they can sometimes complicate installations due to strict version dependencies and potential need for "rollback" procedures. For most users, the steps above are sufficient to get Nunchaku fully functional.

Understanding INT4/FP4 Performance on GPUs:

It's vital to understand that while the svdq-int4_r32-flux.1-dev model uses INT4 quantization (which reduces size and memory), the maximum benefit in raw calculation speed and efficiency for ultra-low precisions like FP4 is exclusively achieved on NVIDIA GPUs with Blackwell architecture (expected in the RTX 5000 series and beyond). Blackwell features dedicated FP4 hardware (5th-gen Tensor Cores).

For RTX 30 Series (Ampere) & RTX 40 Series (Ada Lovelace) GPUs:
- These GPUs are highly capable and feature Tensor Cores with dedicated support for FP8, BF16, and FP16. However, they do not have specific hardware for native FP4 acceleration.
- While they can process data quantized in INT4 or FP4, they do so through emulation or by converting the data to a precision they do natively support (like FP16/BF16) before calculation. Therefore, you will not see significant speed improvements directly from the "FP4" aspect on these GPUs.
- The primary benefit on these GPUs will be lower VRAM consumption. This allows you to load the model more easily or potentially work with slightly higher resolutions if VRAM was previously a bottleneck.
Older series (e.g., RTX 20 series or GTX 16 series): Compatibility for advanced features like INT4/FP4 might be limited or nonexistent, often requiring FP32 or FP16.

Your time and energy are better spent following this guide to ensure the functionality and general optimization that Nunchaku does offer through its broader features.

Why svdq-int4_r32-flux.1-dev is the Recommended Choice for RTX 3000/4000 Series (Non-Blackwell) GPUs:

The svdq-int4_r32-flux.1-dev.safetensors model is designed with a mixed-precision strategy that is specifically optimized for efficiency and quality on existing GPUs like the RTX 30 and RTX 40 series. While it leverages INT4 for significant model size reduction and VRAM savings, it crucially integrates BF16 (Bfloat16) layers for critical parts of the model.

This combination is ideal for your GPU because:

Excellent BF16/FP16 Support: RTX 30 (Ampere) and RTX 40 (Ada Lovelace) series GPUs have excellent native and efficient hardware support for BF16 and FP16 operations within their Tensor Cores. This ensures that the parts of the model requiring higher numerical precision (e.g., fine details in character generation) are processed efficiently, maintaining high visual quality without relying on FP4 hardware.
VRAM Efficiency from INT4: The INT4 layers still provide substantial VRAM savings, enabling you to load the model on GPUs with less memory or push to higher resolutions or batch sizes where VRAM was previously a limitation.

In short, this model provides an optimal balance: lower VRAM consumption (thanks to INT4 for storage/transfer) and high image quality (thanks to efficient BF16 processing where needed), leveraging your current GPU's capabilities effectively. This is why we recommend using this specific model for most home setups with RTX 30 or RTX 40 series GPUs.