Puppet - Graph Algorithms [Ruby]

Status: PUBLISHED
Project: Puppet
Project home page: https://github.com/puppetlabs/puppet
Language: Ruby
Tags: #graph #algorithm

Help Code Catalog grow: suggest your favorite code or weight in on open article proposals.

Table of contents

Context
Problem
Overview
Implementation details
Testing
Related
References
Copyright notice

Context

Puppet, an automated administrative engine for your Linux, Unix, and Windows systems, performs administrative tasks (such as adding users, installing packages, and updating server configurations) based on a centralized specification.

Puppet deals with resources: files, users, services, packages and so on. Resources are defined in manifest files.

By default, Puppet applies unrelated resources in the order in which they’re written in the manifest. If a resource must be applied before or after some other resource, declare a relationship between them to show that their order isn’t coincidental. You can also make changes in one resource cause a refresh of some other resource. See the Relationships and ordering page for more information.

Problem

In what order should Puppet apply resources to respect the relationships? How can it detect and report circular dependencies between resources? How does it know everything a resource depends on, directly or indirectly, or everything that depends on it?

Overview

Puppet represents relationships between resources as a graph with vertices being resources and edges being relationships. Then the problems above can be solved with standard graph algorithms.

In this article we review Puppet’s implementation of generic graph algorithms and don’t touch the application logic on top of them.

Implementation details

Graph definition

Generic graph algorithms are implemented in SimpleGraph, which is extended by more app-specific RelationshipGraph and Catalog.

The graph stores:

@in_to - a map from vertices to incoming edges.
@out_from - a map from vertices to outgoing edges.
@upstream_from - a map from vertices to vertices it’s reachable from.
@downstream_from - a map from vertices to reachable vertices.

  #
  # All public methods of this class must maintain (assume ^ ensure) the following invariants, where "=~=" means
  # equiv. up to order:
  #
  #   @in_to.keys =~= @out_to.keys =~= all vertices
  #   @in_to.values.collect { |x| x.values }.flatten =~= @out_from.values.collect { |x| x.values }.flatten =~= all edges
  #   @in_to[v1][v2] =~= @out_from[v2][v1] =~= all edges from v1 to v2
  #   @in_to   [v].keys =~= vertices with edges leading to   v
  #   @out_from[v].keys =~= vertices with edges leading from v
  #   no operation may shed reference loops (for gc)
  #   recursive operation must scale with the depth of the spanning trees, or better (e.g. no recursion over the set
  #       of all vertices, etc.)
  #
  # This class is intended to be used with DAGs.  However, if the
  # graph has a cycle, it will not cause non-termination of any of the
  # algorithms.
  #
  def initialize
    @in_to = {}
    @out_from = {}
    @upstream_from = {}
    @downstream_from = {}
  end

Adding and removing vertices and edges

Adding a vertex:

  # Add a new vertex to the graph.
  def add_vertex(vertex)
    @in_to[vertex]    ||= {}
    @out_from[vertex] ||= {}
  end

Removing a vertex. Besides removing the vertex from the maps, it removes all adjacent edges and also clears reachable vertices - more about this below.

  # Remove a vertex from the graph.
  def remove_vertex!(v)
    return unless vertex?(v)
    @upstream_from.clear
    @downstream_from.clear
    (@in_to[v].values+@out_from[v].values).flatten.each { |e| remove_edge!(e) }
    @in_to.delete(v)
    @out_from.delete(v)
  end

Adding an edge. It adds both adjacent vertices, resets reachable vertices and updates the edge maps. The last 4 lines may look like dead code, but they are not - note the assignments to @in_to and @out_to in the right-hand sides.

  # Add a new edge.  The graph user has to create the edge instance,
  # since they have to specify what kind of edge it is.
  def add_edge(e,*a)
    return add_relationship(e,*a) unless a.empty?
    e = Puppet::Relationship.from_data_hash(e) if e.is_a?(Hash)
    @upstream_from.clear
    @downstream_from.clear
    add_vertex(e.source)
    add_vertex(e.target)
    # Avoid multiple lookups here. This code is performance critical
    arr = (@in_to[e.target][e.source] ||= [])
    arr << e unless arr.include?(e)
    arr = (@out_from[e.source][e.target] ||= [])
    arr << e unless arr.include?(e)
  end

Removing an edge:

  # Remove an edge from our graph.
  def remove_edge!(e)
    if edge?(e.source,e.target)
      @upstream_from.clear
      @downstream_from.clear
      @in_to   [e.target].delete e.source if (@in_to   [e.target][e.source] -= [e]).empty?
      @out_from[e.source].delete e.target if (@out_from[e.source][e.target] -= [e]).empty?
    end
  end

Dependency closure

Finding all dependencies of a resource requires finding all vertices the resource is reachable from.

The comment appears to be wrong - it’s the definition of a dependent, not a dependency - but the code is correct.

  # Which resources depend upon the given resource.
  def dependencies(resource)
    vertex?(resource) ? upstream_from_vertex(resource).keys : []
  end

Simple recursive algorithm to do it. The results are caches in @upstream_from, which is reset whenever an edge is added or removed anywhere in the graph.

  def upstream_from_vertex(v)
    return @upstream_from[v] if @upstream_from[v]
    result = @upstream_from[v] = {}
    @in_to[v].keys.each do |node|
      result[node] = 1
      result.update(upstream_from_vertex(node))
    end
    result
  end

Dependents are found the same way.

Tarjan’s algorithm and finding cycles

Implementation of Tarjan’s strongly connected components algorithm. The implementation is very close to the description on Wikipedia. The pseudo-code presented on Wikipedia stores values, like index or lowlink, right on the node. Here it is not possible, so, as noted in the comments, the state is stored in a separate map.

  # This is a simple implementation of Tarjan's algorithm to find strongly
  # connected components in the graph; this is a fairly ugly implementation,
  # because I can't just decorate the vertices themselves.
  #
  # This method has an unhealthy relationship with the find_cycles_in_graph
  # method below, which contains the knowledge of how the state object is
  # maintained.
  def tarjan(root, s)
    # initialize the recursion stack we use to work around the nasty lack of a
    # decent Ruby stack.
    recur = [{ :node => root }]

    while not recur.empty? do
      frame = recur.last
      vertex = frame[:node]

      case frame[:step]
      when nil then
        s[:index][vertex]   = s[:number]
        s[:lowlink][vertex] = s[:number]
        s[:number]          = s[:number] + 1

        s[:stack].push(vertex)
        s[:seen][vertex] = true

        frame[:children] = adjacent(vertex)
        frame[:step]     = :children

      when :children then
        if frame[:children].length > 0 then
          child = frame[:children].shift
          if ! s[:index][child] then
            # Never seen, need to recurse.
            frame[:step] = :after_recursion
            frame[:child] = child
            recur.push({ :node => child })
          elsif s[:seen][child] then
            s[:lowlink][vertex] = [s[:lowlink][vertex], s[:index][child]].min
          end
        else
          if s[:lowlink][vertex] == s[:index][vertex] then
            this_scc = []
            loop do
              top = s[:stack].pop
              s[:seen][top] = false
              this_scc << top
              break if top == vertex
            end
            s[:scc] << this_scc
          end
          recur.pop               # done with this node, finally.
        end

      when :after_recursion then
        s[:lowlink][vertex] = [s[:lowlink][vertex], s[:lowlink][frame[:child]]].min
        frame[:step] = :children

      else
        fail "#{frame[:step]} is an unknown step"
      end
    end
  end

The strongly connected components are used to find cycles in the graph - any strongly connected component with more than one vertex contains a cycle. Cycles are also possible if a vertex declares a dependency on itself. If there are no circular dependencies, then the graph is a DAG (Directed Acyclic Graph).

  # Find all cycles in the graph by detecting all the strongly connected
  # components, then eliminating everything with a size of one as
  # uninteresting - which it is, because it can't be a cycle. :)
  #
  # This has an unhealthy relationship with the 'tarjan' method above, which
  # it uses to implement the detection of strongly connected components.
  def find_cycles_in_graph
    state = {
      :number => 0, :index => {}, :lowlink => {}, :scc => [],
      :stack => [], :seen => {}
    }

    # we usually have a disconnected graph, must walk all possible roots
    vertices.each do |vertex|
      if ! state[:index][vertex] then
        tarjan vertex, state
      end
    end

    # To provide consistent results to the user, given that a hash is never
    # assured to return the same order, and given our graph processing is
    # based on hash tables, we need to sort the cycles internally, as well as
    # the set of cycles.
    #
    # Given we are in a failure state here, any extra cost is more or less
    # irrelevant compared to the cost of a fix - which is on a human
    # time-scale.
    state[:scc].select do |component|
      multi_vertex_component?(component) || single_vertex_referring_to_self?(component)
    end.map do |component|
      component.sort
    end.sort
  end

Traversing the graph

Traversing a graph from source. Passing direction allows walking the graph both “up” and “down”. The comment appears to be wrong - it’s a depth-first search (DFS), not breadth-first (BFS). There’s an easy way to distinguish non-recursive DFS from BFS: DFS uses stacks (last-in-first-out) and BFS uses queues (first-in-first-out).

  # Just walk the tree and pass each edge.
  def walk(source, direction)
    # Use an iterative, breadth-first traversal of the graph. One could do
    # this recursively, but Ruby's slow function calls and even slower
    # recursion make the shorter, recursive algorithm cost-prohibitive.
    stack = [source]
    seen = Set.new
    until stack.empty?
      node = stack.shift
      next if seen.member? node
      connected = adjacent(node, :direction => direction)
      connected.each do |target|
        yield node, target
      end
      stack.concat(connected)
      seen << node
    end
  end

Testing

The test suite is made up of a large number of very short and very clear tests. E.g. one of the tests for finding cycles:

    it "should report multi-vertex loops" do
      add_edges :a => :b, :b => :c, :c => :a
      expect(Puppet).to receive(:err).with(/Found 1 dependency cycle:\n\(Notify\[a\] => Notify\[b\] => Notify\[c\] => Notify\[a\]\)/)
      cycle = @graph.report_cycles_in_graph.first
      expect_cycle_to_include(cycle, :a, :b, :c)
    end

See our article about graph algorithms in Terraform. Terraform is another open-source infrastructure management tool, albeit its use cases are quite different than Puppet’s. It implements many of the same ideas for similar purposes, but in Go.

References

Copyright notice

Puppet is licensed under the Apache-2.0 License.