Homebrew - Downloading Software From a Variety of Sources [Ruby]

Status: PUBLISHED
Project: Homebrew
Project home page: https://github.com/Homebrew/brew
Language: Ruby
Tags: #strategy #template-method

Help Code Catalog grow: suggest your favorite code or weight in on open article proposals.

Table of contents

Context
Problem
Overview
Implementation details
Testing
Observations
References
Copyright notice

Context

Homebrew is The Missing Package Manager for macOS (or Linux). It offers a command line interface with commands like brew install <package>.

Available packages are described with Ruby formulae that include download URLs and source code repositories, e.g.

class Python3 < Formula
  homepage "https://www.python.org/"
  url "https://www.python.org/ftp/python/3.4.3/Python-3.4.3.tar.xz"
  sha256 "b5b3963533768d5fc325a4d7a6bd6f666726002d696f1d399ec06b043ea996b8"
  head "https://hg.python.org/cpython", :using => :hg

Formulae can also include URLs of their dependencies.

Problem

Homebrew needs to download and unpack the resource from the specified location, but how to do it depends on the exact URL. For instance, some formulae require downloading archives with curl; some other require cloning repositories from version control systems such as Git or Mercurial. Some resources need to be unpacked, some don’t.

Homebrew is expected to figure out how to download the resource automatically, but formulae authors should be allowed to explicitly specify the right mechanism to download and unpack the resource.

Overview

Homebrew employs the Strategy pattern to select a downloading algorithm at runtime. Various strategies, such as GitHubGitDownloadStrategy, GitDownloadStrategy, SubversionDownloadStrategy, CurlDownloadStrategy or LocalBottleDownloadStrategy implement fetch(timeout) and a few other methods.

Strategies inherit from AbstractDownloadStrategy which provides common functionality, e.g. unpacking the download, caching, turning logging on and off.

The exact strategy is picked based on the URL, with a way to choose a strategy explicitly.

A similar approach is applied to unpacking resourced with UnpackStrategy. The details are out of the scope of this article.

There are numerous intricacies around the usage of cookies, user agents, HTTP redirects, git submodules and so on.

Implementation details

Choosing the download strategy. It’s inferred from the URL unless it’s overridden with using. If you are new to Ruby, if using.is_a?(Class) && using < AbstractDownloadStrategy must look very strange. It means that using is a class and is a subclass of AbstractDownloadStrategy.

# Helper class for detecting a download strategy from a URL.
#
# @api private
class DownloadStrategyDetector
  def self.detect(url, using = nil)
    if using.nil?
      detect_from_url(url)
    elsif using.is_a?(Class) && using < AbstractDownloadStrategy
      using
    elsif using.is_a?(Symbol)
      detect_from_symbol(using)
    else
      raise TypeError,
            "Unknown download strategy specification #{using.inspect}"
    end
  end

  def self.detect_from_url(url)
    case url
    when GitHubPackages::URL_REGEX
      CurlGitHubPackagesDownloadStrategy
    when %r{^https?://github\.com/[^/]+/[^/]+\.git$}
      GitHubGitDownloadStrategy
    when %r{^https?://.+\.git$},
         %r{^git://},
         %r{^https?://git\.sr\.ht/[^/]+/[^/]+$}
      GitDownloadStrategy
    when %r{^https?://www\.apache\.org/dyn/closer\.cgi},
         %r{^https?://www\.apache\.org/dyn/closer\.lua}
      CurlApacheMirrorDownloadStrategy
    when %r{^https?://(.+?\.)?googlecode\.com/svn},
         %r{^https?://svn\.},
         %r{^svn://},
         %r{^svn\+http://},
         %r{^http://svn\.apache\.org/repos/},
         %r{^https?://(.+?\.)?sourceforge\.net/svnroot/}
      SubversionDownloadStrategy
    when %r{^cvs://}
      CVSDownloadStrategy
    when %r{^hg://},
         %r{^https?://(.+?\.)?googlecode\.com/hg},
         %r{^https?://(.+?\.)?sourceforge\.net/hgweb/}
      MercurialDownloadStrategy
    when %r{^bzr://}
      BazaarDownloadStrategy
    when %r{^fossil://}
      FossilDownloadStrategy
    else
      CurlDownloadStrategy
    end
  end

  def self.detect_from_symbol(symbol)
    case symbol
    when :hg                     then MercurialDownloadStrategy
    when :nounzip                then NoUnzipCurlDownloadStrategy
    when :git                    then GitDownloadStrategy
    when :bzr                    then BazaarDownloadStrategy
    when :svn                    then SubversionDownloadStrategy
    when :curl                   then CurlDownloadStrategy
    when :cvs                    then CVSDownloadStrategy
    when :post                   then CurlPostDownloadStrategy
    when :fossil                 then FossilDownloadStrategy
    else
      raise TypeError, "Unknown download strategy #{symbol} was requested."
    end
  end
end

AbstractStrategy (base class for all strategies):

# @abstract Abstract superclass for all download strategies.
#
# @api private
class AbstractDownloadStrategy

  # ...

  # The download URL.
  #
  # @api public
  sig { returns(String) }
  attr_reader :url

  def initialize(url, name, version, **meta)
    @url = url
    @name = name
    @version = version
    @cache = meta.fetch(:cache, HOMEBREW_CACHE)
    @meta = meta
    @quiet = false
    extend Pourable if meta[:bottle]
  end

  # Download and cache the resource at {#cached_location}.
  #
  # @api public
  def fetch(timeout: nil); end

  # Unpack {#cached_location} into the current working directory.
  #
  # Additionally, if a block is given, the working directory was previously empty
  # and a single directory is extracted from the archive, the block will be called
  # with the working directory changed to that directory. Otherwise this method
  # will return, or the block will be called, without changing the current working
  # directory.
  #
  # @api public
  def stage(&block)
    UnpackStrategy.detect(cached_location,
                          prioritise_extension: true,
                          ref_type: @ref_type, ref: @ref)
                  .extract_nestedly(basename:             basename,
                                    prioritise_extension: true,
                                    verbose:              verbose? && !quiet?)
    chdir(&block) if block
  end

   # ...

CurlDownloadStrategy, as its name suggests, ultimately calls curl with the right set of arguments.

Strategies for various version control systems (Git, Mercurial, Subversion, etc.) inherit from VCSDownloadStrategy. It follows the Template Method pattern: fetch is implemented in VCSDownloadStrategy, but it relies on child strategies to implement methods such as clone_repo.

# @abstract Abstract superclass for all download strategies downloading from a version control system.
#
# @api private
class VCSDownloadStrategy < AbstractDownloadStrategy
  REF_TYPES = [:tag, :branch, :revisions, :revision].freeze

  def initialize(url, name, version, **meta)
    super
    @ref_type, @ref = extract_ref(meta)
    @revision = meta[:revision]
    @cached_location = @cache/"#{name}--#{cache_tag}"
  end

  # Download and cache the repository at {#cached_location}.
  #
  # @api public
  def fetch(timeout: nil)
    end_time = Time.now + timeout if timeout

    ohai "Cloning #{url}"

    if cached_location.exist? && repo_valid?
      puts "Updating #{cached_location}"
      update(timeout: timeout)
    elsif cached_location.exist?
      puts "Removing invalid repository from cache"
      clear_cache
      clone_repo(timeout: end_time)
    else
      clone_repo(timeout: end_time)
    end

    version.update_commit(last_commit) if head?

    return if @ref_type != :tag || @revision.blank? || current_revision.blank? || current_revision == @revision

    raise <<~EOS
      #{@ref} tag should be #{@revision}
      but is actually #{current_revision}
    EOS
  end

  def fetch_last_commit
    fetch
    last_commit
  end

  def commit_outdated?(commit)
    @last_commit ||= fetch_last_commit
    commit != @last_commit
  end

  def head?
    version.respond_to?(:head?) && version.head?
  end

  # @!attribute [r] last_commit
  # Return last commit's unique identifier for the repository.
  # Return most recent modified timestamp unless overridden.
  #
  # @api public
  sig { returns(String) }
  def last_commit
    source_modified_time.to_i.to_s
  end

  private

  def cache_tag
    raise NotImplementedError
  end

  def repo_valid?
    raise NotImplementedError
  end

  sig { params(timeout: T.nilable(Time)).void }
  def clone_repo(timeout: nil); end

  sig { params(timeout: T.nilable(Time)).void }
  def update(timeout: nil); end

  def current_revision; end

  def extract_ref(specs)
    key = REF_TYPES.find { |type| specs.key?(type) }
    [key, specs[key]]
  end
end

GitDownloadStrategy. clone_repo is called by GitDownloadStrategy#fetch. To interact with git, it calls the git command installed on the user machine directly, rather than use a library like ruby-git.

# Strategy for downloading a Git repository.
#
# @api public
class GitDownloadStrategy < VCSDownloadStrategy
  def initialize(url, name, version, **meta)
    super
    @ref_type ||= :branch
    @ref ||= "master"
  end
  
  # ...

  sig { params(timeout: T.nilable(Time)).void }
  def clone_repo(timeout: nil)
    command! "git", args: clone_args, timeout: timeout&.remaining

    command! "git",
             args:    ["config", "homebrew.cacheversion", cache_version],
             chdir:   cached_location,
             timeout: timeout&.remaining
    checkout(timeout: timeout)
    update_submodules(timeout: timeout) if submodules?
  end
end

Code to resolve a URL (follow redirects), get file name, modification time and size:

  def resolve_url_basename_time_file_size(url, timeout: nil)
    @resolved_info_cache ||= {}
    return @resolved_info_cache[url] if @resolved_info_cache.include?(url)

    if (domain = Homebrew::EnvConfig.artifact_domain)
      url = url.sub(%r{^((ht|f)tps?://)?}, "#{domain.chomp("/")}/")
    end

    out, _, status= curl_output("--location", "--silent", "--head", "--request", "GET", url.to_s, timeout: timeout)

    lines = status.success? ? out.lines.map(&:chomp) : []

    locations = lines.map { |line| line[/^Location:\s*(.*)$/i, 1] }
                     .compact

    redirect_url = locations.reduce(url) do |current_url, location|
      if location.start_with?("//")
        uri = URI(current_url)
        "#{uri.scheme}:#{location}"
      elsif location.start_with?("/")
        uri = URI(current_url)
        "#{uri.scheme}://#{uri.host}#{location}"
      elsif location.start_with?("./")
        uri = URI(current_url)
        "#{uri.scheme}://#{uri.host}#{Pathname(uri.path).dirname/location}"
      else
        location
      end
    end

    content_disposition_parser = Mechanize::HTTP::ContentDispositionParser.new

    parse_content_disposition = lambda do |line|
      next unless (content_disposition = content_disposition_parser.parse(line.sub(/; *$/, ""), true))

      filename = nil

      if (filename_with_encoding = content_disposition.parameters["filename*"])
        encoding, encoded_filename = filename_with_encoding.split("''", 2)
        filename = URI.decode_www_form_component(encoded_filename).encode(encoding) if encoding && encoded_filename
      end

      # Servers may include '/' in their Content-Disposition filename header. Take only the basename of this, because:
      # - Unpacking code assumes this is a single file - not something living in a subdirectory.
      # - Directory traversal attacks are possible without limiting this to just the basename.
      File.basename(filename || content_disposition.filename)
    end

    filenames = lines.map(&parse_content_disposition).compact

    time =
      lines.map { |line| line[/^Last-Modified:\s*(.+)/i, 1] }
           .compact
           .map { |t| t.match?(/^\d+$/) ? Time.at(t.to_i) : Time.parse(t) }
           .last

    file_size =
      lines.map { |line| line[/^Content-Length:\s*(\d+)/i, 1] }
           .compact
           .map(&:to_i)
           .last

    basename = filenames.last || parse_basename(redirect_url)

    @resolved_info_cache[url] = [redirect_url, basename, time, file_size]
  end

Testing

Testing inferring strategies from URLs is rather straightforward. Note that it doesn’t bother to test some seemingly trivial scenarios like explicitly choosing a strategy, or exercising all URL patterns.

describe DownloadStrategyDetector do
  describe "::detect" do
    subject(:strategy_detector) { described_class.detect(url, strategy) }

    let(:url) { Object.new }
    let(:strategy) { nil }

    context "when given Git URL" do
      let(:url) { "git://example.com/foo.git" }

      it { is_expected.to eq(GitDownloadStrategy) }
    end

    context "when given a GitHub Git URL" do
      let(:url) { "https://github.com/homebrew/brew.git" }

      it { is_expected.to eq(GitHubGitDownloadStrategy) }
    end

    it "defaults to curl" do
      expect(strategy_detector).to eq(CurlDownloadStrategy)
    end

    it "raises an error when passed an unrecognized strategy" do
      expect {
        described_class.detect("foo", Class.new)
      }.to raise_error(TypeError)
    end
  end
end

Testing CurlDownloadStrategy boils down to checking that curl is called with the correct arguments.

describe CurlDownloadStrategy do
  subject(:strategy) { described_class.new(url, name, version, **specs) }

  let(:name) { "foo" }
  let(:url) { "https://example.com/foo.tar.gz" }
  let(:version) { "1.2.3" }
  let(:specs) { { user: "download:123456" } }

  it "parses the opts and sets the corresponding args" do
    expect(strategy.send(:_curl_args)).to eq(["--user", "download:123456"])
  end

  describe "#fetch" do
    before do
      strategy.temporary_path.dirname.mkpath
      FileUtils.touch strategy.temporary_path
    end

    it "calls curl with default arguments" do
      expect(strategy).to receive(:curl).with(
        # example.com supports partial requests.
        "--continue-at", "-",
        "--location",
        "--remote-time",
        "--output", an_instance_of(Pathname),
        url,
        an_instance_of(Hash)
      )

      strategy.fetch
    end

  # ...
end

Tests for the rest of DownloadStrategies.

Observations

It’s a little surprising that all strategies are located in the same file. Interestingly, test files for different strategies are different.
The name of the default Git branch is hard-coded to “master”. Since that, git has changed it to “main”.
Unit test coverage differs per strategy. CurlDownloadStrategy spec, for instance, appears rather comprehensive, while specs for git-based strategies - not so much.

References

Copyright notice

Homebrew is licensed under the BSD 2-Clause “Simplified” License.