2017-10-10
The term “Monorepo” gets thrown around a lot, especially in reference to Google’s (or Facebook’s, or Twitter’s) internal development practices, where all source lives in a single repository.
The lesser-known term “Manyrepo” (why this instead of “Polyrepo” is anyone’s guess) describes the more common setup, encouraged by git and GitHub, where each project has its own completely separate source repository.
This article discusses some of the relative merits of Monorepo and Manyrepo architectures, and then attempts to define a hybrid approach preserving as much benefit as possible from both architectures — a “Metarepo”.
What follows is an analysis of some major advantages and disadvantages of a Monorepo architecture with a specific focus on comparison against the more common Manyrepo architecture.
A Monorepo has all source organized in a single hierarchical tree. Related projects or components can be grouped together, allowing users to more easily discover relationships between projects.
Imagine three logically-distinct codebases, implementing:
.pkg
file for distribution.These three projects would typically be modified separately from each other, and in some cases, maybe even by completely different teams.
In a Manyrepo, these may be called <project>-app
,
<project>-cli
, and <project>
. The
link between them would have to be inferred from the shared prefix while
browsing a list of projects, or from documentation in the projects.
If the build project needs to refer to a specific revision of the other two, each project has to define ad-hoc solutions (sometimes git submodules, sometimes by cloning to a build directory and checking out a specified revision, sometimes something more creative).
Submodules may be used to implement a synchronous multi-project commit.
<department/etc.>
└── <sub-team>
└── <project>
├── app
├── build
└── cli
The relationship could then be discovered just by browsing the hierarchy.
(Incidentally, much of this is inspired by Advantages of monolithic version control by Dan Luu.)
In a Monorepo, it’s easy to depend on another project: just use code
from a different path in the hierarchy. At Google, it appears that this
dependency is tracked via build dependencies encoded in their Blaze
(publicly, Bazel) BUILD
files (interestingly, this
probably-accidentally-public-yet-Apache-licensed gist from a Googler
gives some insight into a lot of aspects of their workflow). When a
subtree’s dependencies are all extractable from a BUILD
file, and all of its dependents are also in-tree, it’s possible to
determine the full set of callers and run their tests on each
change.
This fact is important, so I’m going to restate it: with a Monorepo, library versioning is de-emphasized. Instead, a library is expected to maintain a stable API and migrate its callers when the API must change. This depends on being able to make atomic commits across the entire world-state, which is simply not possible in a Manyrepo. This also implies a need for very sophisticated dependency tracking or analysis, and excellent automated testing.
Note that at Google, the Monorepo includes all source used
by projects, not just their own first-party source. For example, if a
Rails project my-app
existed at Google and depended on
Nokogiri, one could assume that it would exist at a path in their
Monorepo along the lines of //third_party/rb/nokogiri
as a
single version, and the project would reference it via a
BUILD
file dependency, maybe something like:
rails_app(name = 'my-app',
...
deps = ['//third_party/rb/nokogiri', ...]
...
)
Then, when a new version of Nokogiri were imported to
//third_party/rb/nokogiri
, the CI run would include tests
for my-app
and any other callers (any subtree rooted with a
BUILD
file, presumably, whose transitive
dependencies include this subtree).
This makes it much less likely for projects to “rot” via rarely-updated libraries.
Knowing that all code available universally exists at a fixed path in a single shared hierarchy makes it easier to build tools to perform operations on multiple projects.
In a Manyrepo architecture, any operation spanning multiple projects first has to know which projects it’s working on, and decide some hierarchical structure for them (typically, a flat namespace for each GitHub account). It must hope that the revisions on the current master branches are compatible if it’s attempting to integrate the projects.
In some cases, it may not even be possible within a Manyrepo to
interact with both projects in source form without modifying one or the
other. Imagine a Rails app running a rubygem specified in
Gemfile
with a git source. In order to perform integration
testing using local copies of both the Rails app and the gem, one first
has to modify the Gemfile
to refer to the gem by path.
Monorepos eliminate this case: The Rails app would indicate a dependency
on the gem in the source tree. If the gem source changed, it would be
rebuilt as a step in the Rails application build, and the application
would refer to the new version.
Relatedly, certain problems like static analysis, global dependency analysis, code search, and so forth are simpler in a Monorepo just because of not having to juggle multiple repositories and hope that master branches represent a consistent state.
First and foremost, this is not the way the open-source world works. Bundler, rubygems, yarn — the list goes on — the common case for these tools is fetching a package from a server with a given version, unpacking it in some managed directory, and running it from there. Many CI systems and CD pipelines will assume they are testing or deploying a unit that maps onto the repository as a whole.
While building a Monorepo does provide an organization with the unique ability to build tooling precisely in alignment with what they need, the burden of having to do so is substantial. This is getting easier over time as Google releases more public versions of their own tools (protobuf, Bazel, GRPC, and Go, to name a few).
The requirement that there be a single version of source and that all dependent source be upgraded synchronously can make it more difficult to iterate on an API, and removes a common tool used to build confidence in less-safe changes. Presumably there are many ways to work around this, but developers coming from the open-source world will not be familiar with them.
Automatically running tests for the full set of transitive dependencies on each change gets expensive, especially because the most complex applications tend to be the ones with the most dependencies.
Care has to be taken to loop the right people in on every change, but at the same time, a downstream consumer of a library won’t always be interested in reviewing internal changes, even though they may break their app through some unexpected means. In a Manyrepo, the application developer will usually manually verify as they upgrade several libraries at once. In a Monorepo, it’s almost completely left up to automated testing to ensure that a library change is safe.
This can, of course, be viewed as a mixed blessing, in that it might encourage better test hygiene.
Again, a mixed blessing. While changing callsites becomes, at least to a degree, the domain of the library author, which could slow down library development, it also encourages more thoughtful interface design.
It is difficult to export fragments of a Monorepo to the internet at large. Google does this using copybara, which literally copies code out of their Monorepo and into public repositories.
Largely recapping what we’ve covered above, a Manyrepo is a collection of smaller repositories: most commonly git repositories all hosted under a single account on GitHub. All of the advantages and disadvantages of Monorepos listed above are in contrast to Manyrepos, so there’s no need to rehash them here.
A few summary assertions based on analysis above:
Many of the advantages of Monorepos are tantalizing, but it’s difficult to envision a way to migrate a multi-thousand-project Manyrepo into a Monorepo wholesale. The number of components that would have to change all at once would be overwhelming — just for starters:
What if we could find some pathway to progressively enhance a Manyrepo architecture into a future state that captured more of the advantages of a Monorepo (and maybe even eased a plausible future transition to a full Monorepo architecture?) Can we find a way to accomplish this without requiring a big-bang flip to a new set of processes, tools, and workflows?
The rest of this document describes one possible solution to this problem, styled a “Metarepo”.
The five-second pitch: A Manyrepo woven into a Monorepo by a suite of tools and a single coordinating repository containing:
Consider again the example given previously of three sub-projects:
foo-app
, A macOS app;foo-cli
, A command-line interface to the same
functionality provided by the app; andfoo
, A packaging repo to bundle the app and the CLI
into a single .pkg
file for distribution.In a Manyrepo architecture, these would be three separate
repositories (foo-app
, foo-cli
, and
foo
). In a Monorepo architecture, they might occupy
subtrees at //prod-eng/dev-infra/foo/{app,cli,build}
).
To translate this into a Metarepo, we would have mappings in a
special repo (say, github.com/<org>/_meta
),
mapping:
github.com/<org>/foo-app
to
//prod-eng/dev-infra/foo/app
);github.com/<org>/foo-app
to
29239d776a58a9e3538de2746b47f84cc16280b7
).The repository would be automatically updated by a job triggered on GitHub’s Organization Webhooks, providing a consistent view of the world at a given moment.
Changes spanning multiple repositories would be handled by registering the change with a higher-level coordination tool than GitHub. This tool could create stub PRs on GitHub, but when the change was merged from the tool, it would merge the PRs on GitHub, but change the revision of each repository in the Metarepo atomically.
Command-line tooling would have to be built to manage sparse-population of this tree and specifying cross-repository dependencies with automatic backfill when a dependency is required, as well as some interaction pathways to move around this tree, update all repositories to the latest head, revisit some previous state, synchronize multi-project changes, and so on.
The most appealing aspect of this strategy from the perspective of a
large established Manyrepo is that the adaptation can be done
progressively. While the Metarepo (<org>/_meta
) is
being developed, developers can continue completely unaware of its
existence. Their help can be enlisted briefly to help develop a
first-pass hierarchy for the existing repositories.
Once the hierarchical structure is ready to roll out, developers would have to adjust to new paths to their existing codebases, but if done correctly by providing some related and compelling tooling at the same time, the benefits of a browsable hierarchical structure alone should prevent major complaints. Once in their project directory, developers would continue to interact with their codebases precisely as they had before, and GitHub webhooks would passively update the Metarepo as each change happened.
Once developers are all adjusted to the hierarchical structure, the
next step is to ship tooling to discourage developers from running
git pull
in a single repo, replacing that workflow with a
command to update the master branch of all checked-out
repositories according to the current state of the Metarepo. At this
point, developers are all working from a consistent snapshot world-view
across projects. They would slowly adapt to the idea that all of the
repositories under the Metarepo-managed directory would typically be
updated for them.
Once developers are adjusted to this consistent worldview, we can
ship tools to make cross-project changes. For this particular workflow,
we could employ Gerrit
instead of GitHub Pull Requests. A user would run some command on the
command line (probably highly analogous to start
,
diff
, upload
, from Android’s repo tool) which
would upload the cross-project diff to Gerrit, while creating “stub” PRs
on GitHub for each project, which would point at the diff on Gerrit.
Gerrit would somehow embed the CI statuses from GitHub and share
reviewer assignments. When the Gerrit issue was approved and “merged”,
the GitHub PRs would be automatically merged, and the Metarepo would
atomically update all component projects. In this way, no developer
could update at the wrong time and get an inconsistent state.
From here, it becomes a game of building more and better tooling around this multi-project workflow until developers prefer it to the existing single-project workflow. Once a majority of developers are using the multi-project workflow, we can start to consider the multi-project workflow primary and begin to discourage the single-project workflow, eventually building tooling that only works with the new workflows.
In parallel with this push away from single-project workflows, we can
start bringing third-party dependencies in-tree and keeping a single
endorsed version of each. Dependencies on these third-party modules
would be made understandable by tooling (and probably indexed in the
<org>/_meta
repo), and CI for changes to these
projects would include all projects whose transitive dependencies
include it.
Once the organization is fully adapted to workflows that are
indifferent to the underlying number of unit repositories, it becomes
tractable to grow the Metarepo into a true Monorepo (whether or not it’s
even desirable at that point is hard to assess from this vantage point).
Migrating all code into the <org>/_meta
repo using
something like GVFS
would be one way to pull this off.
A Metarepo can provide precisely the same discoverability as a Monorepo, since the hierarchical structure appears exactly the same as if it were in fact a Monorepo. The underlying composition of thousands of individual repositories actually provides a natural unit for sparse-checkout: the first time a directory is entered, the backing repository can be fetched from the git server.
Since, like in a Monorepo, all unit repositories are held at a consistent snapshot world-view, we can derive the same benefits as a Monorepo, though in a true Monorepo like at Google, all third-party code is held in the same Monorepo. This is entirely possible with a Metarepo, and not really a stretch of the technology, but it’s a big cultural adjustment that would take some time and finesse to implement.
Monorepos make tooling very simple to write. Metarepos can make it a bit simpler, since you benefit from the fixed path and consistent worldview, but the fact of having many underlying repositories means we lose a bit of the benefit on this one. It’s unclear how much exactly.
Almost all of the Monorepo Disadvantages we discussed previously are nearly-identical in breadth and depth in a Metarepo. Simulating the advantages of a Monorepo brings its disadvantages along for the ride, namely:
One specific disadvantage though — Managing a Public/Private Split — preserves more of the characteristics of a Manyrepo. Since the underlying storage of a public component would still be a separate git repository, it could simply be backed by a public repository on GitHub.
A Metarepo is probably the most reasonable way for a Manyrepo organization to progressively transition toward a Monorepo architecture. Obviously this is a huge project, but this document has outlined some broad steps that can accomplish the change progressively (building the plane while flying it).