Overview
Today, the DeepCausality project announces UltraGraph v0.8, a ground-up rewrite of our hypergraph library delivering up to 1,300x speedups, enabling sub-second analytics on 100-million-node graphs, and making billion-node analytics economically feasible on a single machine.
This release introduces a new dual-state architecture designed for the complete lifecycle of graph data. Graphs begin in a flexible DynamicGraph state, optimized for fast, O(1) mutations as the structure evolves. When you’re ready for analysis, a single .freeze() call transforms the graph into a hyper-optimized, immutable CsmGraph based on a cache-friendly Struct of Arrays (SoA) memory layout. This “compilation” step is the key to our performance, virtually eliminating cache misses and unlocking near-linear scaling. If the graph needs to evolve further, simply .unfreeze().
The new implementation was born out of necessity. The ongoing work on emergent causality within the DeepCausality project demanded a system that could be both highly dynamic during evolution and blazing fast during analysis. Our previous petgraph-based foundation could not keep up any longer, so we built the next generation of UltraGraph, inspired by the state-of-the-art NWHypergraph (NWHy) architecture and heavily optimized for Rust’s memory model. The key elements of the new UltraGraph implementation are:
- The introduction of a SoA CsrAdjacency type instead of a simple CSR row.
- The separation of forward and backward CsrAdjacency.
- The addition of a new dual-state graph evolution lifecycle.
SoA CsrAdjacency
In conventional CSR-based graph representations (like NWHypergraph), adjacency information is typically packed together row-wise in a “single structure” per edge or neighbor, which is technically an Array of Structs (AoS) layout, or in simple terms, “rows of neighbors.” UltraGraph, however, takes a different approach. UltraGraph introduces a CsrAdjacency type that implements a Struct of Arrays (SoA) pattern:
#[derive(Default)]
pub(crate) struct CsrAdjacency<W> {
pub(crate) offsets: Vec<usize>,
pub(crate) targets: Vec<usize>,
pub(crate) weights: Vec<W>,
}
This layout splits each component of the adjacency data — offsets, targets, and weights — into separate, contiguous memory regions:
- offsets: Starting positions of each node’s adjacency list.
- targets: The target node indices for each edge.
- weights: Edge weights (optional, type-parametric).
Benchmarks show up to 1,300× speedups compared to the previous implementation, and memory usage seems to approach the theoretical limit, with the implication that billion-node graph analytics is now possible on a single commodity workstation instead of a cluster.
This has two critical advantages:
-
Better cache utilization: When performing traversal or shortest path algorithms, the CPU can stream just the fields it needs (often offsets and targets) without pulling in unnecessary weight data — avoiding cache pollution and improving locality.
-
SIMD-friendliness: SoA layouts enable vectorized processing (e.g., with AVX) far more easily than AoS.
Forward and Backward CsrAdjacency
Moreover, UltraGraph uses two separate CsrAdjacency instances: one for successor or outbound edges and another one for backward or inbound edges. This dual-CSR setup is more explicit and efficient than mixing directions within a single row layout because it reduces CPU cache pollution and thereby directly supports fast and efficient algorithm implementations. It is worth noting that some CSR systems only store forward edges and reconstruct backward edges on the fly, which conserves memory but is computationally inefficient. UltraGraph deliberately traded a bit more memory for drastically better algorithm performance, as shown in the benchmarks.
The backward node list is particularly useful in causality-based inference algorithms, where backtracking is often required, and is thus particularly well suited for DeepCausality. Memory usage remains low due to the combined effects of the Struct of Arrays layout and the clean separation between forward and backward adjacency. This design leads to a predictable, flat memory layout with minimal overhead:
- No per-node allocation overhead
- No padding, no vtables, no boxed pointers
- No HashMaps or linked lists
- No indexing needed due to simple offset
- Struct of Arrays (SoA) leads to a predictable, flat memory layout that incentivizes CPU prefetching.
- When a node has no outbound nodes or weights, there is zero allocation, thus saving memory.
- Near-zero memory fragmentation
Memory fragmentation is largely prevented because of the freeze/unfreeze operation in the graph evolution lifecycle.
Calling .freeze()
compacts and linearizes the structure, which removes any prior allocation gaps from the dynamic
phase
and thus results in a clean, continuous memory structure.
The New Graph Evolution Lifecycle
The single biggest change in UltraGraph 0.8 is its new dual-state architecture. We recognize that graph-based systems have two distinct phases: a dynamic “Evolve” phase, where the structure is built and modified, and a stable “Analyze” phase, where high-speed queries are essential.
-
The DynamicGraph State: This is now the default state for every new graph. It’s an adjacency-list-based structure optimized for flexibility. Adding nodes and edges is a cheap O(1) operation, perfect for systems where the graph structure emerges dynamically over time.
-
The Frozen State: When you’re ready for analysis, you call
.freeze()
. This is a one-time “compilation” step that transforms your graph into a hyper-optimized, immutable Compressed Sparse Row (CSR) format. This state is designed for one thing: raw speed.
This new lifecycle is our answer to the challenges of emergent causality. It provides a controlled, predictable way to
transition between a state of evolution and a state of high-performance analysis. The best part? When your frozen graph
needs
to be modified, you call .unfreeze()
, and your graph structure can evolve further.
Performance That Speaks for Itself
All benchmarks were completed on a 2023 Macbook Pro with a M3 Max CPU.
Dynamic Graph
The dynamic graph structure, when the graph is in an unfrozen state, is optimized for efficient mutation. The table below summarizes the performance of the key operations.
Benchmark Name | Graph Size | Operation | Estimated Time (Median) | Outliers Detected |
---|---|---|---|---|
small_add_node |
10 | add_node |
29.099 ns | 14% (14 / 100) |
medium_add_node |
100 | add_node |
45.864 ns | 12% (12 / 100) |
large_add_node |
1,000 | add_node |
39.293 ns | 11% (11 / 100) |
small_get_node |
10 | get_node |
3.9417 ns | 8% (8 / 100) |
medium_get_node |
100 | get_node |
3.9849 ns | 2% (2 / 100) |
large_get_node |
1,000 | get_node |
3.9916 ns | 7% (7 / 100) |
- SMALL = 10;
- MEDIUM = 100;
- LARGE = 1,000;
Benchmark source code in ultragraph/benches
Static CSM Graph
The impact of the new dual-state architecture is most visible when comparing the new CsmGraph implementation against our
previous MatrixGraph-based engine. By freezing a graph into a stable CsmGraph
, we eliminate CPU cache misses inherent
to traditional, flexible graph structures, thereby allowing the CPU to operate with maximum efficiency. The following
table compares the performance of graph-based reasoning algorithms in DeepCausality before and after the UltraGraph
rewrite. The “After” benchmarks were run on a frozen CsmGraph
, which leverages a highly efficient Compressed Sparse
Row (CSR) memory layout.
Benchmark | Time Before (Old) | Time After (New) | Improvement Factor |
---|---|---|---|
small_linear_graph_reason_all_causes |
2.760 µs | 78.79 ns | ~35x |
small_linear_graph_reason_subgraph_from_cause |
1.507 µs | 52.41 ns | ~28x |
small_linear_graph_reason_shortest_path |
1.690 µs | 120.19 ns | ~14x |
medium_linear_graph_reason_all_causes |
509.940 µs | 5.23 µs | ~97x |
medium_linear_graph_reason_subgraph_from_cause |
245.250 µs | 2.63 µs | ~93x |
medium_linear_graph_reason_shortest_path |
286.080 µs | 4.35 µs | ~65x |
large_linear_graph_reason_all_causes |
70.221 ms | 51.70 µs | ~1,358x |
large_linear_graph_reason_subgraph_from_cause |
34.933 ms | 25.79 µs | ~1,354x |
large_linear_graph_reason_shortest_path |
35.424 ms | 43.80 µs | ~808x |
small_multi_layer_graph_reason_all_causes |
1.248 µs | 43.59 ns | ~28x |
small_multi_layer_graph_reason_subgraph_from_cause |
489.420 ns | 32.02 ns | ~15x |
small_multi_layer_graph_reason_shortest_path |
427.450 ns | 62.99 ns | ~7x |
Average Speedup across all use cases: ~300x
- SMALL = 10;
- MEDIUM = 1,000;
- LARGE = 10,000;
Benchmark source code in deep_causality/benches
The new architecture causes the largest and most significant performance gains for algorithms running over large graphs (10k or more nodes) because of its close alignment with contemporary hardware. By combining an instantaneous O(1) lookup with a perfectly linear scan over a node’s neighbors, Ultragraph creates the ideal scenario for the CPU’s prefetcher to easily anticipates a straight-line sprint through memory. The result becomes more notable the more data the prefetcher can load ahead of time, thus the disproportional performance gains on larger graphs.
Memory Usage and Scaling
Number of Nodes | Memory Usage | evaluate_subgraph_from_root |
evaluate_shortest_path |
evaluate_single_cause |
---|---|---|---|---|
100,000 | 55 MB | 0.68 ms | 0.57 ms | 5.4 ns |
1,000,000 | 350 MB | 11.12 ms | 6.95 ms | 5.5 ns |
10,000,000 | 3 GB | 114 ms | 85.80 ms | 5.6 ns |
100,000,000 | 32 GB | 1.23 s | 0.98 s | 5.5 ns |
l | ||||
Key Observations: |
-
Constant Time to get a single node: The benchmark evaluate_single_cause returns always takes about 5.5. ns regardless of whether the node lookup happens before or during the benchmark loop and regardless of whether blackbox is used or not. The time does not change with the size of the graph because the implementation of the underlying get_node is just two O(1) array lookup to find the index and than a straight redirect to a virtual memory address, which in this case, is close to the physical limit of Apples UMA architecture. On another architectures, the exact value may differ, but in general should remain constant and only be bound by the bandwidth and latency of the physical memory.
-
Sub-Second traversal up to a hundred million nodes: Notice, this uses a linear graph as base line by purpose to estimate the best case scenario. On complex or imbalanced graphs, you will see different results. The key here is that a linear graph plays exactly to the strength of the continues memory layout created by the underlying CSR structure.
-
Near-Linear Scalability: Both the memory usage and the execution time for the subgraph and shortest_path tasks appear to scale in a roughly linear fashion with the number of nodes. A 10x increase in nodes results in a roughly 10x-15x increase in time and memory. This indicates a highly efficient implementation with a complexity close to O(N).
-
All benchmarks are single-threaded. The performance you see is from a single core. Initial experiments showed that for graphs up to 1 million nodes, the overhead of even highly-optimized parallel libraries like rayon resulted in a net performance loss of 30% or more compared to the single-threaded version. This is a testament to the extreme efficiency of the CSR layout when paired with modern CPU caches and prefetchers. The results suggest that meaningful gains from concurrency will only appear on massive graphs (likely 10M-50M nodes and above). However, this requires a concurrency model carefully designed to avoid the cache-invalidation issues common in work-stealing schedulers (used by rayon and Tokio). Developing such a cache-friendly concurrency model was out of scope for this release, but it remains an open challenge for future work. Contributions are welcome.
Based on Ultragraph’s performance, a simple interpolation shows:
- 1 Billion Nodes: The shortest path on a graph of this size in about 12 seconds, requiring approximately 236 GB of RAM.
- 10 Billion Nodes: Require just under 2 TB of RAM and could run a shortest path query in about 2.5 minutes.
Economic Impact
This level of performance on a single machine has profound economic implications, which is best demonstrated by an estimate of the Total Cost of Ownership (TCO) for running a 10-billion node graph on major cloud providers and bare metal.
Assumptions:
- Instance Type: AWS
u-3tb1.56xlarge
(2.9 TiB RAM) vs. GCPm1-megamem-96
(2.9 TiB RAM). - Region:
us-east-1
(N. Virginia) for AWS andus-central1
(Iowa) for GCP, which are typically cost-effective. - Commitment: 1-Year Committed Plan (AWS Savings Plan / GCP CUD), as this is the standard for any persistent workload and offers significant savings over on-demand pricing.
- Storage: 500 GB of high-performance SSD storage (
gp3
on AWS,pd-ssd
on GCP) for the OS
Scenario 1: Persistent Workload TCO (1-Year Commitment)
Cost Component | AWS (u-3tb1.56xlarge) | GCP (m1-megamem-96) | Bare Metal (Dedicated Server) |
---|---|---|---|
Commitment Plan | 1-Year Savings Plan | 1-Year Committed Use Discount | 1-Year Contract |
Effective Hourly Rate | ~$10.99 | ~$10.57 | ~$7.53 (Calculated) |
Monthly Billed Cost | ~$8,071 | ~$7,813 | ~$5,542 |
Total Annual Billed Cost | ~$96,852 | ~$93,756 | ~$66,504 |
Scenario 2: Temporary Workload TCO (On-Demand / Monthly Rate)
Cost Component | AWS (u-3tb1.56xlarge) | GCP (m1-megamem-96) | Bare Metal (Dedicated Server) |
---|---|---|---|
Commitment Plan | On-Demand | On-Demand | Monthly (No Contract) |
Effective Hourly Rate | ~$28.77 | ~$25.84 | ~$9.32 (Calculated) |
Cost for 1 Month (730 hrs) | ~$21,002 | ~$18,863 | ~$6,800 |
Cost for 3 Months | ~$63,006 | ~$56,589 | ~$20,400 |
The most important takeaway is that for an annual cost between $67,000 and $95,000 you can analyze a 10-billion node graph. For a one-month long project, say a proof of concept, a bare-metal server costs roughly one-third of what an on-demand cloud instance would cost. Any series A funded startup can now at least explore large scale graph analytics without over spending on engineering, hardware or expensive license fees for commercial solutions.
Graph Algorithms
The UltraGraph
crate provides a selection of high-performance algorithms for
graph analysis. These algorithms are implemented on the static, optimized graph structure for fast and
efficient computation.
🔄 find_cycle()
Use Case:
- Dependency resolution: Detect circular dependencies in software build systems (e.g., Cargo, Bazel).
- Process control: Ensure process chains are acyclic.
❓ has_cycle()
Use Case:
- Workflow engines: Verify Directed Acyclic Graph (DAG) constraints in ETL jobs.
- Graph validation: Ensure schema or data pipelines conform to top-down execution semantics.
🔃 topological_sort()
Use Case:
- Build systems / Task scheduling: Order steps where some tasks depend on others (compilers, project planners).
- Data science pipelines: Stage DAG-based machine learning workflows (e.g., feature extraction → training → evaluation).
📡 is_reachable(start_index, stop_index)
Use Case:
- Access control: Can User A reach Resource B? (e.g., network ACLs or graph-based permission models).
- Social graph analysis: Check if two users are connected via follower/following chains.
📏 shortest_path_len(start_index, stop_index)
Use Case:
- Performance modeling: Compute minimal latency paths in distributed systems or manufacturing processes.
- Simplified heuristics: Estimate costs in search/AI agents without computing full paths.
🛣️ shortest_path(start_index, stop_index)
Use Case:
- Navigation/routing: Real-time route finding (e.g., maps, logistics).
- Interactive systems: Trace dependency or influence paths between data nodes (e.g., in causal graphs).
📉 shortest_weighted_path(start_index, stop_index)
Use Case:
- Cost-aware routing: Find paths in weighted graphs like transport networks, financial transaction flows, or workflow runtimes.
- Risk modeling: Choose minimum-exposure routes in threat graphs or attack trees.
🧩 strongly_connected_components()
Use Case:
- Community detection: Find user clusters in social graphs or discussion networks.
- Deadlock/root cause detection: In concurrent systems or process management (e.g., OS schedulers, container orchestration).
🧠 betweenness_centrality()
Use Case:
- Influencer detection: Identify key nodes in social, transportation, or communication networks.
- Bottleneck analysis: Discover chokepoints in network infrastructure or data processing graphs.
What This Means for DeepCausality
This new version of UltraGraph
is the engine powering the next version of DeepCausality. It provides the foundation
to:
- Scale to Massive Graphs: Analyze systems at a scale that was previously impossible due to memory constraints.
- Model Emergent Systems: Directly support the “grow as you go” nature of dynamic causality.
- Build with Confidence: Leverage a stable, performant, and dependency-free core for mission-critical applications.
Conclusion
UltraGraph 0.8 offers unprecedented speed for hypergraph analytics on larger graphs. Try it. The Future is now.
About
DeepCausality is a hyper-geometric computational causality library that enables fast and deterministic context-aware causal reasoning in Rust. The DeepCausality project is hosted at the Linux Foundation for Artificial Intelligence and Data (LF AI & Data). Learn more about DeepCausality on GitHub and join the DeepCausality-Announce Mailing List.
The LF AI & Data Foundation supports an open artificial intelligence (AI) and data community and drives open-source innovation in the AI and data domains by enabling collaboration and the creation of new opportunities for all members of the community. For more information, please visit lfaidata.foundation.
The author and maintainer of the DeepCausality project, Marvin Hansen, is the director of Emet-Labs, a FinTech research company specializing in applying advanced computational causality to financial markets.