High-performance CUDA library leveraging coherent shared memory. Automatic job routing to the optimal GPU execution mode. Up to 7.6x faster than cudaMemcpy for buffers < 256 KB.
UnifiedFlow v3.0 transforms an experimental pipeline into a production-ready library that automatically adapts to every workload type.
Direct CPU↔GPU access without cudaMemcpy via coherent unified memory. Completely eliminates transfer latency for small buffers.
The PolicyEngine analyzes each job (size, compute intensity) and routes it to the optimal GPU mode. Configurable thresholds, manual override available.
Small (batch-pop), Large (1 CTA/job), Tiled (cooperative tile stealing). Each mode has its own persistent kernel and dedicated queue.
3 independent queues eliminate head-of-line blocking. Small jobs are never blocked by large ones. Watermarks and backpressure built-in.
All 3 GPU kernels stay active and consume jobs without relaunch. Zero launch overhead, minimal latency.
Asynchronous submission with submit() returning a JobFuture.
Wait, non-blocking polling, or fire-and-forget.
A lightweight runtime that classifies, routes and executes each job with the optimal GPU strategy.
The application submits a job via pipeline.submit(data, size, op_kind).
Zero configuration required.
Analyzes size (bytes_in) and compute intensity (FLOP/byte). Automatically determines: MODE_SMALL, MODE_LARGE or MODE_TILED.
Q_small, Q_large, Q_tiled — each with independent watermarks, stop/pause signals, and statistics.
64 buffers x 256 KB in coherent shared memory. Zero-copy access via NVLink-C2C (900 GB/s).
kernel_small (256 threads, batch-pop), kernel_large (512 threads, streaming), kernel_tiled (256 threads, tile stealing). Parallel execution.
Each mode is optimized for a different workload profile. The PolicyEngine automatically selects the best one.
Ideal for small buffers (< 64 KB). A single CTA processes multiple jobs sequentially via batch-pop, reducing atomic contention.
For large buffers (≥ 512 KB). Each CTA takes a complete job and processes it via grid-stride loop streaming.
For compute-heavy workloads (matmul, convolution, attention). Multiple CTAs cooperate via atomic tile stealing.
Real benchmarks executed on NVIDIA GB10 Grace-Blackwell. CUDA 13.0.88, 20 jobs per size.
| Size | Speedup | Winner |
|---|---|---|
| 4 KB | 4.4x | UnifiedFlow |
| 16 KB | 7.6x | UnifiedFlow |
| 64 KB | 6.4x | UnifiedFlow |
| 256 KB | 1.2x | UnifiedFlow |
| 1 MB | 0.48x | cudaMemcpy |
| 4 MB | 0.20x | cudaMemcpy |
The hybrid mode automatically selects the best strategy based on your application profile.
LLM tokens, embeddings, small tensors. UnifiedFlow eliminates transfer latency for buffers < 16 KB.
Audio segments, spectrograms, Whisper features. Performance sweet spot at 16-64 KB.
Camera frames, image preprocessing, computer vision. Significant advantage up to 256 KB.
MatMul, convolution, attention. Tiled mode leverages tile stealing to saturate GPU SMs.
Large files, datasets, batch ML training. Hybrid mode automatically switches to cudaMemcpy (DMA).
Thousands of small jobs/second, IoT, sensors, telemetry. Persistent kernel avoids all launch overhead.
Submit jobs in a single line. The library handles the rest: classification, routing, execution and result retrieval.
// Initialize the pipeline PipelineV3 pipeline(cfg); pipeline.init(); // Submit a job (auto classification) JobFuture fut = pipeline.submit(data, size, OpKind::TRANSFORM); // Wait for the result if (fut.wait(1000)) { process(fut.data(), fut.bytes()); } pipeline.release(fut);
PipelineV3Config cfg; cfg.num_buffers = 64; cfg.bytes_per_buffer = 256 * 1024; // Classification thresholds cfg.policy.mode = PolicyMode::HYBRID; cfg.policy.big_threshold = 512 * 1024; cfg.policy.intensity_threshold = 8.0; cfg.policy.memcpy_threshold = 1024 * 1024; // Streams for hybrid mode cfg.memcpy_streams = 4; PipelineV3 pipeline(cfg);
auto st = pipeline.get_stats(); printf("Total: %lu jobs\n", st.jobs_processed); printf(" Small: %lu\n", st.jobs_small); printf(" Large: %lu\n", st.jobs_large); printf(" Tiled: %lu\n", st.jobs_tiled); printf("Tiles: %lu\n", st.tiles_processed); printf("Bytes: %lu\n", st.total_bytes);
// Force a specific mode auto fut = pipeline.submit_forced( data, size, ExecutionMode::MODE_TILED); // Pause / Resume pipeline.pause(); pipeline.resume(); // Drain all jobs pipeline.drain(5000); // Graceful shutdown pipeline.stop();
Two formats available to integrate UnifiedFlow v3.0.1 PRO into your projects. Compiled for ARM64 (NVIDIA GB10 Grace Blackwell).
Dynamic loading at runtime. Ideal for systems sharing the library across multiple applications.
Embedded directly into the executable. Ideal for standalone deployment with no external dependencies.
ⓘ Platform: ARM64 (aarch64-linux-gnu) — Requires CUDA 12.0+ — C++17