Optimized for NVIDIA GB10 Grace Blackwell

Adaptive Zero-Copy
CPU↔GPU Pipeline

High-performance CUDA library leveraging coherent shared memory. Automatic job routing to the optimal GPU execution mode. Up to 7.6x faster than cudaMemcpy for buffers < 256 KB.

📈 View Benchmarks 💻 API Reference

7.6x

Peak speedup (16 KB)

Adaptive GPU modes

15.6

GB/s max throughput

Memory copies (zero-copy)

Features

Adaptive GPU runtime

UnifiedFlow v3.0 transforms an experimental pipeline into a production-ready library that automatically adapts to every workload type.

⚡

Zero-Copy NVLink

Direct CPU↔GPU access without cudaMemcpy via coherent unified memory. Completely eliminates transfer latency for small buffers.

🧠

Automatic classification

The PolicyEngine analyzes each job (size, compute intensity) and routes it to the optimal GPU mode. Configurable thresholds, manual override available.

📊

3 specialized GPU modes

Small (batch-pop), Large (1 CTA/job), Tiled (cooperative tile stealing). Each mode has its own persistent kernel and dedicated queue.

🔀

Multi-Queue, no blocking

3 independent queues eliminate head-of-line blocking. Small jobs are never blocked by large ones. Watermarks and backpressure built-in.

🚀

Persistent kernels

All 3 GPU kernels stay active and consume jobs without relaunch. Zero launch overhead, minimal latency.

📑

Simple JobFuture API

Asynchronous submission with submit() returning a JobFuture. Wait, non-blocking polling, or fire-and-forget.

Architecture

Adaptive data flow

A lightweight runtime that classifies, routes and executes each job with the optimal GPU strategy.

💻

CPU Producer (submit)

The application submits a job via pipeline.submit(data, size, op_kind). Zero configuration required.

🧠

PolicyEngine (classification)

Analyzes size (bytes_in) and compute intensity (FLOP/byte). Automatically determines: MODE_SMALL, MODE_LARGE or MODE_TILED.

🔀

Multi-Queue (3 lock-free rings)

Q_small, Q_large, Q_tiled — each with independent watermarks, stop/pause signals, and statistics.

⚡

BufferPool (unified memory)

64 buffers x 256 KB in coherent shared memory. Zero-copy access via NVLink-C2C (900 GB/s).

🎬

GPU Persistent Kernels (3 streams)

kernel_small (256 threads, batch-pop), kernel_large (512 threads, streaming), kernel_tiled (256 threads, tile stealing). Parallel execution.

GPU Modes

Three specialized strategies

Each mode is optimized for a different workload profile. The PolicyEngine automatically selects the best one.

MODE_SMALL (C)

Batch-Pop

1 CTA = N jobs

Ideal for small buffers (< 64 KB). A single CTA processes multiple jobs sequentially via batch-pop, reducing atomic contention.

✓ Buffers < 64 KB
✓ 256 threads x 2 CTAs
✓ Batch-pop up to 8 jobs
✓ Speedup: 4.4x to 7.6x
✓ Latency-optimized

MODE_LARGE (A)

Streaming CTA

1 job = 1 CTA

For large buffers (≥ 512 KB). Each CTA takes a complete job and processes it via grid-stride loop streaming.

✓ Buffers ≥ 512 KB
✓ 512 threads x 4 CTAs
✓ Internal grid-stride loop
✓ Optimal memory streaming
✓ Hybrid cudaMemcpy mode ≥ 1 MB

MODE_TILED (B)

Tile Stealing

1 job = N CTAs

For compute-heavy workloads (matmul, convolution, attention). Multiple CTAs cooperate via atomic tile stealing.

✓ Intensity ≥ 8 FLOP/byte
✓ 256 threads x 8 CTAs
✓ Dynamic tile partitioning
✓ Scales with SM count
✓ MatMul, Conv, Attention

Benchmarks

Measured results on GB10

Real benchmarks executed on NVIDIA GB10 Grace-Blackwell. CUDA 13.0.88, 20 jobs per size.

📈 Throughput comparison (MB/s)

4 KB

UnifiedFlow: 9.84

4 KB

cudaMemcpy: 2.26

16 KB

UnifiedFlow: 69.6

16 KB

cudaMemcpy: 9.15

64 KB

UnifiedFlow: 233.3

64 KB

cudaMemcpy: 36.7

⚡ Speedup vs optimized cudaMemcpy

Size	Speedup	Winner
4 KB	4.4x	UnifiedFlow
16 KB	7.6x	UnifiedFlow
64 KB	6.4x	UnifiedFlow
256 KB	1.2x	UnifiedFlow
1 MB	0.48x	cudaMemcpy
4 MB	0.20x	cudaMemcpy

Dominance zone by buffer size

UnifiedFlow dominates 4 KB – 256 KB (4.4x to 7.6x)

Transition zone 256 KB – 1 MB

cudaMemcpy (DMA) ≥ 1 MB (hybrid mode)

4 KB 16 KB 64 KB 256 KB 1 MB 4 MB 16 MB

Use Cases

Workload recommendations

The hybrid mode automatically selects the best strategy based on your application profile.

🧠

Real-time inference

LLM tokens, embeddings, small tensors. UnifiedFlow eliminates transfer latency for buffers < 16 KB.

4-16 KB MODE_SMALL 4-8x

🎧

Audio processing

Audio segments, spectrograms, Whisper features. Performance sweet spot at 16-64 KB.

16-64 KB MODE_SMALL 6-8x

🎨

Video frames

Camera frames, image preprocessing, computer vision. Significant advantage up to 256 KB.

64-256 KB MODE_SMALL/LARGE 1.2-6x

🔢

Matrix multiplication

MatMul, convolution, attention. Tiled mode leverages tile stealing to saturate GPU SMs.

Compute-heavy MODE_TILED Scalable

📦

Bulk transfers

Large files, datasets, batch ML training. Hybrid mode automatically switches to cudaMemcpy (DMA).

≥ 1 MB MODE_MEMCPY Automatic

🚀

High-frequency streaming

Thousands of small jobs/second, IoT, sensors, telemetry. Persistent kernel avoids all launch overhead.

< 4 KB MODE_SMALL Zero-launch

API

Simple yet powerful interface

Submit jobs in a single line. The library handles the rest: classification, routing, execution and result retrieval.

💻 Simple submission

// Initialize the pipeline
PipelineV3 pipeline(cfg);
pipeline.init();

// Submit a job (auto classification)
JobFuture fut = pipeline.submit(data, size,
    OpKind::TRANSFORM);

// Wait for the result
if (fut.wait(1000)) {
    process(fut.data(), fut.bytes());
}
pipeline.release(fut);

⚙ Advanced configuration

PipelineV3Config cfg;
cfg.num_buffers = 64;
cfg.bytes_per_buffer = 256 * 1024;

// Classification thresholds
cfg.policy.mode = PolicyMode::HYBRID;
cfg.policy.big_threshold = 512 * 1024;
cfg.policy.intensity_threshold = 8.0;
cfg.policy.memcpy_threshold = 1024 * 1024;

// Streams for hybrid mode
cfg.memcpy_streams = 4;

PipelineV3 pipeline(cfg);

📊 Metrics and statistics

auto st = pipeline.get_stats();

printf("Total: %lu jobs\n",
    st.jobs_processed);
printf("  Small: %lu\n", st.jobs_small);
printf("  Large: %lu\n", st.jobs_large);
printf("  Tiled: %lu\n", st.jobs_tiled);
printf("Tiles: %lu\n",
    st.tiles_processed);
printf("Bytes: %lu\n",
    st.total_bytes);

🚀 Forced submission / batch

// Force a specific mode
auto fut = pipeline.submit_forced(
    data, size,
    ExecutionMode::MODE_TILED);

// Pause / Resume
pipeline.pause();
pipeline.resume();

// Drain all jobs
pipeline.drain(5000);

// Graceful shutdown
pipeline.stop();

Specifications

Technical specifications

⚡ Target platform

✓ NVIDIA GB10 Grace Blackwell (ARM64)
✓ NVLink-C2C 900 GB/s coherent
✓ CUDA 12.0+ (tested on 13.0.88)
✓ Compatible with Hopper and Ampere

🔧 Build

✓ C++17, CMake 3.24+
✓ Static (.a) and shared (.so) library
✓ GCC 9+ / NVCC 12+
✓ CUDA Separable Compilation

📊 Runtime

✓ 3 persistent kernels (3 streams)
✓ 64 buffers x 256 KB (unified memory)
✓ 3 lock-free ring buffer queues
✓ Hybrid cudaMemcpy mode (≥ 1 MB)

🔒 Robustness

✓ Strict buffer ownership (FSM)
✓ Backpressure with watermarks
✓ Graceful shutdown via poison pills
✓ Built-in per-mode metrics

Distribution

Download UnifiedFlow

Two formats available to integrate UnifiedFlow v3.0.1 PRO into your projects. Compiled for ARM64 (NVIDIA GB10 Grace Blackwell).

📦

.so

Shared Library

Dynamic loading at runtime. Ideal for systems sharing the library across multiple applications.

libcpugpu_pipeline_v2.so
Dynamic linking
Reduced executable size
Update without recompilation

⬇ Download .so

📦

Static Library

Embedded directly into the executable. Ideal for standalone deployment with no external dependencies.

libcpugpu_pipeline_v2.a
Static linking
Self-contained executable
Zero runtime dependency

⬇ Download .a

ⓘ Platform: ARM64 (aarch64-linux-gnu) — Requires CUDA 12.0+ — C++17

Adaptive Zero-CopyCPU↔GPU Pipeline