UltraGraph 0.8: 1,300× Faster Graph Analytics — No Cluster Needed

Overview

Today, the DeepCausality project announces UltraGraph v0.8, a ground-up rewrite of our hypergraph library delivering up to 1,300x speedups, enabling sub-second analytics on 100-million-node graphs, and making billion-node analytics economically feasible on a single machine.

This release introduces a new dual-state architecture designed for the complete lifecycle of graph data. Graphs begin in a flexible DynamicGraph state, optimized for fast, O(1) mutations as the structure evolves. When you’re ready for analysis, a single .freeze() call transforms the graph into a hyper-optimized, immutable CsmGraph based on a cache-friendly Struct of Arrays (SoA) memory layout. This “compilation” step is the key to our performance, virtually eliminating cache misses and unlocking near-linear scaling. If the graph needs to evolve further, simply .unfreeze().

The new implementation was born out of necessity. The ongoing work on emergent causality within the DeepCausality project demanded a system that could be both highly dynamic during evolution and blazing fast during analysis. Our previous petgraph-based foundation could not keep up any longer, so we built the next generation of UltraGraph, inspired by the state-of-the-art NWHypergraph (NWHy) architecture and heavily optimized for Rust’s memory model. The key elements of the new UltraGraph implementation are:

SoA CsrAdjacency

In conventional CSR-based graph representations (like NWHypergraph), adjacency information is typically packed together row-wise in a “single structure” per edge or neighbor, which is technically an Array of Structs (AoS) layout, or in simple terms, “rows of neighbors.” UltraGraph, however, takes a different approach. UltraGraph introduces a CsrAdjacency type that implements a Struct of Arrays (SoA) pattern:

#[derive(Default)]
pub(crate) struct CsrAdjacency<W> {
    pub(crate) offsets: Vec<usize>,
    pub(crate) targets: Vec<usize>,
    pub(crate) weights: Vec<W>,
}

This layout splits each component of the adjacency data — offsets, targets, and weights — into separate, contiguous memory regions:

Benchmarks show up to 1,300× speedups compared to the previous implementation, and memory usage seems to approach the theoretical limit, with the implication that billion-node graph analytics is now possible on a single commodity workstation instead of a cluster.

This has two critical advantages:

Forward and Backward CsrAdjacency

Moreover, UltraGraph uses two separate CsrAdjacency instances: one for successor or outbound edges and another one for backward or inbound edges. This dual-CSR setup is more explicit and efficient than mixing directions within a single row layout because it reduces CPU cache pollution and thereby directly supports fast and efficient algorithm implementations. It is worth noting that some CSR systems only store forward edges and reconstruct backward edges on the fly, which conserves memory but is computationally inefficient. UltraGraph deliberately traded a bit more memory for drastically better algorithm performance, as shown in the benchmarks.

The backward node list is particularly useful in causality-based inference algorithms, where backtracking is often required, and is thus particularly well suited for DeepCausality. Memory usage remains low due to the combined effects of the Struct of Arrays layout and the clean separation between forward and backward adjacency. This design leads to a predictable, flat memory layout with minimal overhead:

Memory fragmentation is largely prevented because of the freeze/unfreeze operation in the graph evolution lifecycle. Calling .freeze() compacts and linearizes the structure, which removes any prior allocation gaps from the dynamic phase and thus results in a clean, continuous memory structure.

The New Graph Evolution Lifecycle

The single biggest change in UltraGraph 0.8 is its new dual-state architecture. We recognize that graph-based systems have two distinct phases: a dynamic “Evolve” phase, where the structure is built and modified, and a stable “Analyze” phase, where high-speed queries are essential.

  1. The DynamicGraph State: This is now the default state for every new graph. It’s an adjacency-list-based structure optimized for flexibility. Adding nodes and edges is a cheap O(1) operation, perfect for systems where the graph structure emerges dynamically over time.

  2. The Frozen State: When you’re ready for analysis, you call .freeze(). This is a one-time “compilation” step that transforms your graph into a hyper-optimized, immutable Compressed Sparse Row (CSR) format. This state is designed for one thing: raw speed.

This new lifecycle is our answer to the challenges of emergent causality. It provides a controlled, predictable way to transition between a state of evolution and a state of high-performance analysis. The best part? When your frozen graph needs to be modified, you call .unfreeze(), and your graph structure can evolve further.

Performance That Speaks for Itself

All benchmarks were completed on a 2023 Macbook Pro with a M3 Max CPU.

Dynamic Graph

The dynamic graph structure, when the graph is in an unfrozen state, is optimized for efficient mutation. The table below summarizes the performance of the key operations.

Benchmark Name Graph Size Operation Estimated Time (Median) Outliers Detected
small_add_node 10 add_node 29.099 ns 14% (14 / 100)
medium_add_node 100 add_node 45.864 ns 12% (12 / 100)
large_add_node 1,000 add_node 39.293 ns 11% (11 / 100)
small_get_node 10 get_node 3.9417 ns 8% (8 / 100)
medium_get_node 100 get_node 3.9849 ns 2% (2 / 100)
large_get_node 1,000 get_node 3.9916 ns 7% (7 / 100)

Benchmark source code in ultragraph/benches

Static CSM Graph

The impact of the new dual-state architecture is most visible when comparing the new CsmGraph implementation against our previous MatrixGraph-based engine. By freezing a graph into a stable CsmGraph, we eliminate CPU cache misses inherent to traditional, flexible graph structures, thereby allowing the CPU to operate with maximum efficiency. The following table compares the performance of graph-based reasoning algorithms in DeepCausality before and after the UltraGraph rewrite. The “After” benchmarks were run on a frozen CsmGraph, which leverages a highly efficient Compressed Sparse Row (CSR) memory layout.

Benchmark Time Before (Old) Time After (New) Improvement Factor
small_linear_graph_reason_all_causes 2.760 µs 78.79 ns ~35x
small_linear_graph_reason_subgraph_from_cause 1.507 µs 52.41 ns ~28x
small_linear_graph_reason_shortest_path 1.690 µs 120.19 ns ~14x
medium_linear_graph_reason_all_causes 509.940 µs 5.23 µs ~97x
medium_linear_graph_reason_subgraph_from_cause 245.250 µs 2.63 µs ~93x
medium_linear_graph_reason_shortest_path 286.080 µs 4.35 µs ~65x
large_linear_graph_reason_all_causes 70.221 ms 51.70 µs ~1,358x
large_linear_graph_reason_subgraph_from_cause 34.933 ms 25.79 µs ~1,354x
large_linear_graph_reason_shortest_path 35.424 ms 43.80 µs ~808x
small_multi_layer_graph_reason_all_causes 1.248 µs 43.59 ns ~28x
small_multi_layer_graph_reason_subgraph_from_cause 489.420 ns 32.02 ns ~15x
small_multi_layer_graph_reason_shortest_path 427.450 ns 62.99 ns ~7x

Average Speedup across all use cases: ~300x

Benchmark source code in deep_causality/benches

The new architecture causes the largest and most significant performance gains for algorithms running over large graphs (10k or more nodes) because of its close alignment with contemporary hardware. By combining an instantaneous O(1) lookup with a perfectly linear scan over a node’s neighbors, Ultragraph creates the ideal scenario for the CPU’s prefetcher to easily anticipates a straight-line sprint through memory. The result becomes more notable the more data the prefetcher can load ahead of time, thus the disproportional performance gains on larger graphs.

Memory Usage and Scaling

Number of Nodes Memory Usage evaluate_subgraph_from_root evaluate_shortest_path evaluate_single_cause
100,000 55 MB 0.68 ms 0.57 ms 5.4 ns
1,000,000 350 MB 11.12 ms 6.95 ms 5.5 ns
10,000,000 3 GB 114 ms 85.80 ms 5.6 ns
100,000,000 32 GB 1.23 s 0.98 s 5.5 ns
l
Key Observations:

Based on Ultragraph’s performance, a simple interpolation shows:

Economic Impact

This level of performance on a single machine has profound economic implications, which is best demonstrated by an estimate of the Total Cost of Ownership (TCO) for running a 10-billion node graph on major cloud providers and bare metal.

Assumptions:

Scenario 1: Persistent Workload TCO (1-Year Commitment)

Cost Component AWS (u-3tb1.56xlarge) GCP (m1-megamem-96) Bare Metal (Dedicated Server)
Commitment Plan 1-Year Savings Plan 1-Year Committed Use Discount 1-Year Contract
Effective Hourly Rate ~$10.99 ~$10.57 ~$7.53 (Calculated)
Monthly Billed Cost ~$8,071 ~$7,813 ~$5,542
Total Annual Billed Cost ~$96,852 ~$93,756 ~$66,504

Scenario 2: Temporary Workload TCO (On-Demand / Monthly Rate)

Cost Component AWS (u-3tb1.56xlarge) GCP (m1-megamem-96) Bare Metal (Dedicated Server)
Commitment Plan On-Demand On-Demand Monthly (No Contract)
Effective Hourly Rate ~$28.77 ~$25.84 ~$9.32 (Calculated)
Cost for 1 Month (730 hrs) ~$21,002 ~$18,863 ~$6,800
Cost for 3 Months ~$63,006 ~$56,589 ~$20,400

The most important takeaway is that for an annual cost between $67,000 and $95,000 you can analyze a 10-billion node graph. For a one-month long project, say a proof of concept, a bare-metal server costs roughly one-third of what an on-demand cloud instance would cost. Any series A funded startup can now at least explore large scale graph analytics without over spending on engineering, hardware or expensive license fees for commercial solutions.

Graph Algorithms

The UltraGraph crate provides a selection of high-performance algorithms for graph analysis. These algorithms are implemented on the static, optimized graph structure for fast and efficient computation.

🔄 find_cycle()

Use Case:

has_cycle()

Use Case:

🔃 topological_sort()

Use Case:

📡 is_reachable(start_index, stop_index)

Use Case:

📏 shortest_path_len(start_index, stop_index)

Use Case:

🛣️ shortest_path(start_index, stop_index)

Use Case:

📉 shortest_weighted_path(start_index, stop_index)

Use Case:

🧩 strongly_connected_components()

Use Case:

🧠 betweenness_centrality()

Use Case:

What This Means for DeepCausality

This new version of UltraGraph is the engine powering the next version of DeepCausality. It provides the foundation to:

Conclusion

UltraGraph 0.8 offers unprecedented speed for hypergraph analytics on larger graphs. Try it. The Future is now.

About

DeepCausality is a hyper-geometric computational causality library that enables fast and deterministic context-aware causal reasoning in Rust. The DeepCausality project is hosted at the Linux Foundation for Artificial Intelligence and Data (LF AI & Data). Learn more about DeepCausality on GitHub and join the DeepCausality-Announce Mailing List.

The LF AI & Data Foundation supports an open artificial intelligence (AI) and data community and drives open-source innovation in the AI and data domains by enabling collaboration and the creation of new opportunities for all members of the community. For more information, please visit lfaidata.foundation.

The author and maintainer of the DeepCausality project, Marvin Hansen, is the director of Emet-Labs, a FinTech research company specializing in applying advanced computational causality to financial markets.