Optimized for NVIDIA GB10 Grace Blackwell

Adaptive Zero-Copy
CPU↔GPU Pipeline

High-performance CUDA library leveraging coherent shared memory. Automatic job routing to the optimal GPU execution mode. Up to 7.6x faster than cudaMemcpy for buffers < 256 KB.

7.6x
Peak speedup (16 KB)
3
Adaptive GPU modes
15.6
GB/s max throughput
0
Memory copies (zero-copy)

Adaptive GPU runtime

UnifiedFlow v3.0 transforms an experimental pipeline into a production-ready library that automatically adapts to every workload type.

Zero-Copy NVLink

Direct CPU↔GPU access without cudaMemcpy via coherent unified memory. Completely eliminates transfer latency for small buffers.

🧠

Automatic classification

The PolicyEngine analyzes each job (size, compute intensity) and routes it to the optimal GPU mode. Configurable thresholds, manual override available.

📊

3 specialized GPU modes

Small (batch-pop), Large (1 CTA/job), Tiled (cooperative tile stealing). Each mode has its own persistent kernel and dedicated queue.

🔀

Multi-Queue, no blocking

3 independent queues eliminate head-of-line blocking. Small jobs are never blocked by large ones. Watermarks and backpressure built-in.

🚀

Persistent kernels

All 3 GPU kernels stay active and consume jobs without relaunch. Zero launch overhead, minimal latency.

📑

Simple JobFuture API

Asynchronous submission with submit() returning a JobFuture. Wait, non-blocking polling, or fire-and-forget.

Adaptive data flow

A lightweight runtime that classifies, routes and executes each job with the optimal GPU strategy.

💻

CPU Producer (submit)

The application submits a job via pipeline.submit(data, size, op_kind). Zero configuration required.

🧠

PolicyEngine (classification)

Analyzes size (bytes_in) and compute intensity (FLOP/byte). Automatically determines: MODE_SMALL, MODE_LARGE or MODE_TILED.

🔀

Multi-Queue (3 lock-free rings)

Q_small, Q_large, Q_tiled — each with independent watermarks, stop/pause signals, and statistics.

BufferPool (unified memory)

64 buffers x 256 KB in coherent shared memory. Zero-copy access via NVLink-C2C (900 GB/s).

🎬

GPU Persistent Kernels (3 streams)

kernel_small (256 threads, batch-pop), kernel_large (512 threads, streaming), kernel_tiled (256 threads, tile stealing). Parallel execution.

Three specialized strategies

Each mode is optimized for a different workload profile. The PolicyEngine automatically selects the best one.

MODE_SMALL (C)

Batch-Pop

1 CTA = N jobs

Ideal for small buffers (< 64 KB). A single CTA processes multiple jobs sequentially via batch-pop, reducing atomic contention.

  • Buffers < 64 KB
  • 256 threads x 2 CTAs
  • Batch-pop up to 8 jobs
  • Speedup: 4.4x to 7.6x
  • Latency-optimized
MODE_LARGE (A)

Streaming CTA

1 job = 1 CTA

For large buffers (≥ 512 KB). Each CTA takes a complete job and processes it via grid-stride loop streaming.

  • Buffers ≥ 512 KB
  • 512 threads x 4 CTAs
  • Internal grid-stride loop
  • Optimal memory streaming
  • Hybrid cudaMemcpy mode ≥ 1 MB
MODE_TILED (B)

Tile Stealing

1 job = N CTAs

For compute-heavy workloads (matmul, convolution, attention). Multiple CTAs cooperate via atomic tile stealing.

  • Intensity ≥ 8 FLOP/byte
  • 256 threads x 8 CTAs
  • Dynamic tile partitioning
  • Scales with SM count
  • MatMul, Conv, Attention

Measured results on GB10

Real benchmarks executed on NVIDIA GB10 Grace-Blackwell. CUDA 13.0.88, 20 jobs per size.

📈 Throughput comparison (MB/s)

4 KB
UnifiedFlow: 9.84
4 KB
cudaMemcpy: 2.26
16 KB
UnifiedFlow: 69.6
16 KB
cudaMemcpy: 9.15
64 KB
UnifiedFlow: 233.3
64 KB
cudaMemcpy: 36.7

⚡ Speedup vs optimized cudaMemcpy

Size Speedup Winner
4 KB 4.4x UnifiedFlow
16 KB 7.6x UnifiedFlow
64 KB 6.4x UnifiedFlow
256 KB 1.2x UnifiedFlow
1 MB 0.48x cudaMemcpy
4 MB 0.20x cudaMemcpy
Dominance zone by buffer size
UnifiedFlow dominates 4 KB – 256 KB (4.4x to 7.6x)
Transition zone 256 KB – 1 MB
cudaMemcpy (DMA) ≥ 1 MB (hybrid mode)
4 KB 16 KB 64 KB 256 KB 1 MB 4 MB 16 MB

Workload recommendations

The hybrid mode automatically selects the best strategy based on your application profile.

🧠

Real-time inference

LLM tokens, embeddings, small tensors. UnifiedFlow eliminates transfer latency for buffers < 16 KB.

4-16 KB MODE_SMALL 4-8x
🎧

Audio processing

Audio segments, spectrograms, Whisper features. Performance sweet spot at 16-64 KB.

16-64 KB MODE_SMALL 6-8x
🎨

Video frames

Camera frames, image preprocessing, computer vision. Significant advantage up to 256 KB.

64-256 KB MODE_SMALL/LARGE 1.2-6x
🔢

Matrix multiplication

MatMul, convolution, attention. Tiled mode leverages tile stealing to saturate GPU SMs.

Compute-heavy MODE_TILED Scalable
📦

Bulk transfers

Large files, datasets, batch ML training. Hybrid mode automatically switches to cudaMemcpy (DMA).

≥ 1 MB MODE_MEMCPY Automatic
🚀

High-frequency streaming

Thousands of small jobs/second, IoT, sensors, telemetry. Persistent kernel avoids all launch overhead.

< 4 KB MODE_SMALL Zero-launch

Simple yet powerful interface

Submit jobs in a single line. The library handles the rest: classification, routing, execution and result retrieval.

💻 Simple submission

// Initialize the pipeline
PipelineV3 pipeline(cfg);
pipeline.init();

// Submit a job (auto classification)
JobFuture fut = pipeline.submit(data, size,
    OpKind::TRANSFORM);

// Wait for the result
if (fut.wait(1000)) {
    process(fut.data(), fut.bytes());
}
pipeline.release(fut);

⚙ Advanced configuration

PipelineV3Config cfg;
cfg.num_buffers = 64;
cfg.bytes_per_buffer = 256 * 1024;

// Classification thresholds
cfg.policy.mode = PolicyMode::HYBRID;
cfg.policy.big_threshold = 512 * 1024;
cfg.policy.intensity_threshold = 8.0;
cfg.policy.memcpy_threshold = 1024 * 1024;

// Streams for hybrid mode
cfg.memcpy_streams = 4;

PipelineV3 pipeline(cfg);

📊 Metrics and statistics

auto st = pipeline.get_stats();

printf("Total: %lu jobs\n",
    st.jobs_processed);
printf("  Small: %lu\n", st.jobs_small);
printf("  Large: %lu\n", st.jobs_large);
printf("  Tiled: %lu\n", st.jobs_tiled);
printf("Tiles: %lu\n",
    st.tiles_processed);
printf("Bytes: %lu\n",
    st.total_bytes);

🚀 Forced submission / batch

// Force a specific mode
auto fut = pipeline.submit_forced(
    data, size,
    ExecutionMode::MODE_TILED);

// Pause / Resume
pipeline.pause();
pipeline.resume();

// Drain all jobs
pipeline.drain(5000);

// Graceful shutdown
pipeline.stop();

Technical specifications

⚡ Target platform

  • NVIDIA GB10 Grace Blackwell (ARM64)
  • NVLink-C2C 900 GB/s coherent
  • CUDA 12.0+ (tested on 13.0.88)
  • Compatible with Hopper and Ampere

🔧 Build

  • C++17, CMake 3.24+
  • Static (.a) and shared (.so) library
  • GCC 9+ / NVCC 12+
  • CUDA Separable Compilation

📊 Runtime

  • 3 persistent kernels (3 streams)
  • 64 buffers x 256 KB (unified memory)
  • 3 lock-free ring buffer queues
  • Hybrid cudaMemcpy mode (≥ 1 MB)

🔒 Robustness

  • Strict buffer ownership (FSM)
  • Backpressure with watermarks
  • Graceful shutdown via poison pills
  • Built-in per-mode metrics

Download UnifiedFlow

Two formats available to integrate UnifiedFlow v3.0.1 PRO into your projects. Compiled for ARM64 (NVIDIA GB10 Grace Blackwell).

📦
.so

Shared Library

Dynamic loading at runtime. Ideal for systems sharing the library across multiple applications.

  • libcpugpu_pipeline_v2.so
  • Dynamic linking
  • Reduced executable size
  • Update without recompilation
⬇ Download .so
📦
.a

Static Library

Embedded directly into the executable. Ideal for standalone deployment with no external dependencies.

  • libcpugpu_pipeline_v2.a
  • Static linking
  • Self-contained executable
  • Zero runtime dependency
⬇ Download .a

ⓘ Platform: ARM64 (aarch64-linux-gnu) — Requires CUDA 12.0+ — C++17