# Puppet - Graph Algorithms [Ruby]

Status
PUBLISHED
Project
Puppet
https://github.com/puppetlabs/puppet
Language
Ruby
Tags
#graph #algorithm

Help Code Catalog grow: suggest your favorite code or weight in on open article proposals.

## Context

Puppet, an automated administrative engine for your Linux, Unix, and Windows systems, performs administrative tasks (such as adding users, installing packages, and updating server configurations) based on a centralized specification.

Puppet deals with resources: files, users, services, packages and so on. Resources are defined in manifest files.

By default, Puppet applies unrelated resources in the order in which they’re written in the manifest. If a resource must be applied before or after some other resource, declare a relationship between them to show that their order isn’t coincidental. You can also make changes in one resource cause a refresh of some other resource. See the Relationships and ordering page for more information.

## Problem

In what order should Puppet apply resources to respect the relationships? How can it detect and report circular dependencies between resources? How does it know everything a resource depends on, directly or indirectly, or everything that depends on it?

## Overview

Puppet represents relationships between resources as a graph with vertices being resources and edges being relationships. Then the problems above can be solved with standard graph algorithms.

In this article we review Puppet’s implementation of generic graph algorithms and don’t touch the application logic on top of them.

## Implementation details

### Graph definition

Generic graph algorithms are implemented in `SimpleGraph`, which is extended by more app-specific `RelationshipGraph` and `Catalog`.

The graph stores:

• `@in_to` - a map from vertices to incoming edges.
• `@out_from` - a map from vertices to outgoing edges.
• `@upstream_from` - a map from vertices to vertices it’s reachable from.
• `@downstream_from` - a map from vertices to reachable vertices.
``````  #
# All public methods of this class must maintain (assume ^ ensure) the following invariants, where "=~=" means
# equiv. up to order:
#
#   @in_to.keys =~= @out_to.keys =~= all vertices
#   @in_to.values.collect { |x| x.values }.flatten =~= @out_from.values.collect { |x| x.values }.flatten =~= all edges
#   @in_to[v1][v2] =~= @out_from[v2][v1] =~= all edges from v1 to v2
#   @in_to   [v].keys =~= vertices with edges leading to   v
#   @out_from[v].keys =~= vertices with edges leading from v
#   no operation may shed reference loops (for gc)
#   recursive operation must scale with the depth of the spanning trees, or better (e.g. no recursion over the set
#       of all vertices, etc.)
#
# This class is intended to be used with DAGs.  However, if the
# graph has a cycle, it will not cause non-termination of any of the
# algorithms.
#
def initialize
@in_to = {}
@out_from = {}
@upstream_from = {}
@downstream_from = {}
end
``````

### Adding and removing vertices and edges

``````  # Add a new vertex to the graph.
@in_to[vertex]    ||= {}
@out_from[vertex] ||= {}
end
``````

Removing a vertex. Besides removing the vertex from the maps, it removes all adjacent edges and also clears reachable vertices - more about this below.

``````  # Remove a vertex from the graph.
def remove_vertex!(v)
return unless vertex?(v)
@upstream_from.clear
@downstream_from.clear
(@in_to[v].values+@out_from[v].values).flatten.each { |e| remove_edge!(e) }
@in_to.delete(v)
@out_from.delete(v)
end
``````

Adding an edge. It adds both adjacent vertices, resets reachable vertices and updates the edge maps. The last 4 lines may look like dead code, but they are not - note the assignments to `@in_to` and `@out_to` in the right-hand sides.

``````  # Add a new edge.  The graph user has to create the edge instance,
# since they have to specify what kind of edge it is.
e = Puppet::Relationship.from_data_hash(e) if e.is_a?(Hash)
@upstream_from.clear
@downstream_from.clear
# Avoid multiple lookups here. This code is performance critical
arr = (@in_to[e.target][e.source] ||= [])
arr << e unless arr.include?(e)
arr = (@out_from[e.source][e.target] ||= [])
arr << e unless arr.include?(e)
end
``````
``````  # Remove an edge from our graph.
def remove_edge!(e)
if edge?(e.source,e.target)
@upstream_from.clear
@downstream_from.clear
@in_to   [e.target].delete e.source if (@in_to   [e.target][e.source] -= [e]).empty?
@out_from[e.source].delete e.target if (@out_from[e.source][e.target] -= [e]).empty?
end
end
``````

### Dependency closure

Finding all dependencies of a resource requires finding all vertices the resource is reachable from.

The comment appears to be wrong - it’s the definition of a dependent, not a dependency - but the code is correct.

``````  # Which resources depend upon the given resource.
def dependencies(resource)
vertex?(resource) ? upstream_from_vertex(resource).keys : []
end
``````

Simple recursive algorithm to do it. The results are caches in `@upstream_from`, which is reset whenever an edge is added or removed anywhere in the graph.

``````  def upstream_from_vertex(v)
return @upstream_from[v] if @upstream_from[v]
result = @upstream_from[v] = {}
@in_to[v].keys.each do |node|
result[node] = 1
result.update(upstream_from_vertex(node))
end
result
end
``````

Dependents are found the same way.

### Tarjan’s algorithm and finding cycles

Implementation of Tarjan’s strongly connected components algorithm. The implementation is very close to the description on Wikipedia. The pseudo-code presented on Wikipedia stores values, like `index` or `lowlink`, right on the node. Here it is not possible, so, as noted in the comments, the state is stored in a separate map.

``````  # This is a simple implementation of Tarjan's algorithm to find strongly
# connected components in the graph; this is a fairly ugly implementation,
# because I can't just decorate the vertices themselves.
#
# This method has an unhealthy relationship with the find_cycles_in_graph
# method below, which contains the knowledge of how the state object is
# maintained.
def tarjan(root, s)
# initialize the recursion stack we use to work around the nasty lack of a
# decent Ruby stack.
recur = [{ :node => root }]

while not recur.empty? do
frame = recur.last
vertex = frame[:node]

case frame[:step]
when nil then
s[:index][vertex]   = s[:number]
s[:number]          = s[:number] + 1

s[:stack].push(vertex)
s[:seen][vertex] = true

frame[:step]     = :children

when :children then
if frame[:children].length > 0 then
child = frame[:children].shift
if ! s[:index][child] then
# Never seen, need to recurse.
frame[:step] = :after_recursion
frame[:child] = child
recur.push({ :node => child })
elsif s[:seen][child] then
end
else
this_scc = []
loop do
top = s[:stack].pop
s[:seen][top] = false
this_scc << top
break if top == vertex
end
s[:scc] << this_scc
end
recur.pop               # done with this node, finally.
end

when :after_recursion then
frame[:step] = :children

else
fail "#{frame[:step]} is an unknown step"
end
end
end
``````

The strongly connected components are used to find cycles in the graph - any strongly connected component with more than one vertex contains a cycle. Cycles are also possible if a vertex declares a dependency on itself. If there are no circular dependencies, then the graph is a DAG (Directed Acyclic Graph).

``````  # Find all cycles in the graph by detecting all the strongly connected
# components, then eliminating everything with a size of one as
# uninteresting - which it is, because it can't be a cycle. :)
#
# This has an unhealthy relationship with the 'tarjan' method above, which
# it uses to implement the detection of strongly connected components.
def find_cycles_in_graph
state = {
:number => 0, :index => {}, :lowlink => {}, :scc => [],
:stack => [], :seen => {}
}

# we usually have a disconnected graph, must walk all possible roots
vertices.each do |vertex|
if ! state[:index][vertex] then
tarjan vertex, state
end
end

# To provide consistent results to the user, given that a hash is never
# assured to return the same order, and given our graph processing is
# based on hash tables, we need to sort the cycles internally, as well as
# the set of cycles.
#
# Given we are in a failure state here, any extra cost is more or less
# irrelevant compared to the cost of a fix - which is on a human
# time-scale.
state[:scc].select do |component|
multi_vertex_component?(component) || single_vertex_referring_to_self?(component)
end.map do |component|
component.sort
end.sort
end
``````

### Traversing the graph

Traversing a graph from `source`. Passing `direction` allows walking the graph both “up” and “down”. The comment appears to be wrong - it’s a depth-first search (DFS), not breadth-first (BFS). There’s an easy way to distinguish non-recursive DFS from BFS: DFS uses stacks (last-in-first-out) and BFS uses queues (first-in-first-out).

``````  # Just walk the tree and pass each edge.
def walk(source, direction)
# Use an iterative, breadth-first traversal of the graph. One could do
# this recursively, but Ruby's slow function calls and even slower
# recursion make the shorter, recursive algorithm cost-prohibitive.
stack = [source]
seen = Set.new
until stack.empty?
node = stack.shift
next if seen.member? node
connected = adjacent(node, :direction => direction)
connected.each do |target|
yield node, target
end
stack.concat(connected)
seen << node
end
end
``````

## Testing

The test suite is made up of a large number of very short and very clear tests. E.g. one of the tests for finding cycles:

``````    it "should report multi-vertex loops" do
add_edges :a => :b, :b => :c, :c => :a
expect(Puppet).to receive(:err).with(/Found 1 dependency cycle:\n\(Notify\[a\] => Notify\[b\] => Notify\[c\] => Notify\[a\]\)/)
cycle = @graph.report_cycles_in_graph.first
expect_cycle_to_include(cycle, :a, :b, :c)
end
``````

See our article about graph algorithms in Terraform. Terraform is another open-source infrastructure management tool, albeit its use cases are quite different than Puppet’s. It implements many of the same ideas for similar purposes, but in Go.