This document collects definitions of terms and background information about the Open Source Insights project’s view of the open source software development ecosystem.
The documentation is necessary because the open source world is not only more complex than many may realize, but also the conventions, culture, terminology and technology vary widely among the multiple packaging systems out there (npm, Cargo, and so on). Open Source Insights has chosen a lexicon that works well across the whole ecosystem, but someone familiar with only one system, say npm, may find some terms incorrect or misleading without context. So we must define and explain our terms to avoid confusion.
For ease of navigation, the entries are listed in alphabetical order. If you are reading this for the first time, it’s probably best to start with the definition of System and proceed from there.
The authors of a package, in Insights terminology, is a list of contributors to the package, the developers who wrote the software. Depending on local custom about how it is maintained, the set of authors may only approximate the full set of contributors. See also the entry for Owner.
Information about crates is fetched using the crates.io API (at https://github.com/hcpl/crates.io-http-api-reference), with an occasional full sync to guarantee completeness.
A dependency of a package or version is a separate piece of software that is imported by the package for the build. For example, if a package imports a library to process JSON artifacts, that library is a dependency of the package.
There are two related types of dependencies. The first is a direct dependency, which is one that is listed by the package being built as a necessary component. In our example above, the JSON library would be a direct dependency of the package importing it.
The other type is an indirect or transitive dependency. This is a dependency that is needed by the package because one of its own dependencies needs it but the package itself does not. In our example, if the JSON library in turn imports a formatted print library that is not used by the top-level package, that is a transitive dependency:
package ⇐ JSON library ⇐ print library
The transitive dependency may of course import even more dependencies. The transitive closure of all such imports defines the full set of dependencies needed to build the package. In general, that set forms (usually, depending on the system) a directed acyclic graph, which is called the dependency graph. In Insights, a package’s dependency graph can be seen using the graph viewer linked from the upper right of the Dependencies page.
As well as dependencies that are compiled into the package, there are other classes, such as dependencies that are needed only by tests, or needed to build the package but not by the compiled package. These are called test dependencies and development dependencies, and there may be other classes as well.
An important detail to understand about the dependency graph is that its contents depend on who is asking, and for what purpose. A developer thinking of using the package probably only needs to know which dependencies will be added to the software being built if the package is installed. On the other hand, a developer who maintains the package would also be interested in dependencies needed to build and test the package. The resulting dependency graph might include code generators, test libraries, even potentially the compiler’s own packages.
Even a developer just using the package might want to test it. If the package has a small dependency graph in isolation, but uses a large and complex testing package with many dependencies, the dependency graph can change dramatically as a result.
The Insights project provides the package owner view, the complete dependency graph someone maintaining the package would need. At the moment there is no ability to control which classes of dependencies are included.
A dependent is the inverse of a dependency. If package P has package Q as a dependency then P is a dependent of Q. In the section on dependencies, we show an example dependency graph where a package imports a JSON library that imports a printing package:
package ⇐ JSON library ⇐ print library
In this example, the top-level package is a dependent of the JSON library and both are a dependent of the print library.
As with dependencies, a package may be a direct or indirect dependent of another package. Direct dependents are from packages that explicitly import this package to employ its functionality, and serve as a measure of visibility and popularity of the package. Indirect dependents arise from transitive imports, and serve as a measure of how critical the package is to an ecosystem. A package may have only a few direct dependents, but if those packages are in turn widely used, the package may have large numbers of other packages that depend on it silently, making the package a vital component of the ecosystem.
Packages with many transitive dependents are therefore critical to their system. When such a package breaks or is compromised, that can affect all the packages that depend upon it. Migrating all dependents away from the package or to a repaired version can be expensive and a difficult, distributed task to complete.
The dependent graph can be constructed by inverting the dependency graph.
The Go system refers to modules tracked by the Go language’s module system. Go modules are the unit of versioning for the Go software ecosystem; each module version contains a bundle of one or more Go packages (distinct from Insights packages) that are used to build Go programs. Insights treats a repository that contains Go packages but doesn’t have an explicit module declaration as an implicit module.
Unlike most other systems, Go users do not download from a central packaging authority analogous to npmjs.org for npm. The true home of a Go module is the source code on its hosting site, often GitHub, but most users fetch modules through the secure public module mirror at proxy.golang.org. Insights discovers new modules by reading the module mirror’s index at index.golang.org/index, as well as periodically scanning GitHub.
The Go package discovery site pkg.go.dev provides an index to modules and their repositories.
The term hosting site refers to a usually public website that stores copies of source code repositories for access by developers. The biggest example is of course GitHub, which stores Git repositories, but others are also important. Insights tracks repositories stored at GitHub (Git), GitLab (Git), BitBucket (Git, Mercurial), and more. These sites, and the packages they host, are discovered by following dependencies and other package information as source code and metadata are analyzed.
Maven has several significant hosting sites. For now, the Insights project defines the Maven system to hold packages managed by the Maven Central service. (Insights actually tracks the packages through a separate, Google-internal mirror of the site.)
Insights uses the public API, tracking https://replicate.npmjs.com/_all_docs to enumerate packages and https://registry.npmjs.org to extract information about each one. Insights also periodically fetches the full list of packages to guarantee completeness.
Information about NuGet packages are fetched from the NuGet Server API.
The owners of a package are the set of people or institutions identified by the packaging authority or hosting site to have the right to change the package’s software or metadata. They are usually a small set, whereas the list of authors, which includes all who have contributed, may number in the thousands.
A package-based system is one in which a central website, also known as a packaging authority, holds the source code and metadata, such as versioning, for all the packages in the system. The npm system works this way: to install or examine a Node package, one uses a tool or web browser to fetch information about the package, or the package itself, from npmjs.org. In contrast to a repository-based system, a package-based system therefore provides a central, maintained authority for information about all packages in the system.
Even though the packages can all be accessed through the central website, many packages also have a copy stored on a hosting site so development can be tracked using code versioning software such as Git. In such cases, metadata on the system’s website for the package will usually identify the location of the corresponding Git repository. In practice, it is possible for there to be skew between the separate repository and the source stored on the packaging site, but as a rule users will fetch the package from the packaging site, in effect grabbing a published snapshot, and not from the repository.
A packaging authority is the entity that manages the state of the packages within a system, including defining the rules of versioning and requirement specification, as well as storing released versions of the packages and making them available for download. For package-based systems, it is typically accessed by a website. For instance, the npm system’s packaging authority is at npmjs.org, while Cargo’s is at crates.io. Through these sites, developers can not only download the package software, they can search for appropriate packages, see social signals about the packages, and so on.
For repository-based systems, on the other hand, the software is held external to the packaging authority, such as in a Git repository. The central authority in such systems does still provide all the other services, such as search and per-package social signals. An example is pkg.go.dev for the Go module system. However, the Go project also provides a cryptographically secured download mirror at proxy.golang.org, analogous to but different in nature from the download services provided by other packaging authorities.
A vital role of packaging authorities is curating and reporting on vulnerabilities.
Information about PyPI packages is fetched using the JSON API. Release archives are then downloaded and cached to have their metadata analyzed. Insights learns about new and updated packages from PyPI’s RSS Feeds, and occasionally syncs the list of packages with the Simple Repository API.
A repository is a body of code, typically of a software package or set of packages, held inside a version control system such as Git, possibly hosted at a hosting site such as github.com. In Insights, only public, network-accessible repositories are supported, at least for now. In the future, the service may provide tools to scan privately held repositories.
(The meaning of version in that definition is that of the version control system, not the definition used by Insights. Although these terms correspond to some extent, they are not the same.)
A repository is named by a URL-like string (without a protocol component) such as github.com/google/licensecheck. Inside the repository will be any number of commits known to the version control software, some of which will be used by outside users.
A repository-based system is one in which the reference copy of the source code for each package in the system is stored on a hosting site such as GitHub. The Go language’s packaging system works this way: Each package is identified by a “package path” that is, in effect, a URL to the Git repository for the package stored on a hosting site. In this model, the system’s information is distributed across all the hosting sites that store code for the system, with no central location for managing the packages, in contrast to a package-based system, which has a central authority for all packages.
Because it has no single central location for package information, a repository-based system must use features of the source code management system to maintain the versioning model for its packages.
A requirement, also known as a constraint, is used by a package or a version of a package to select which version or set of versions of a dependency is to be used when building the package. (In Go, this is mostly done at the module level.) A requirement can be a plain version specifier such 1.2.3, or some notation that defines a range of versions, such as >=1.2.3 to mean any version from 1.2.3 onwards.
Requirements are usually interpreted within the rules of semantic versioning (semver), as described at the Semver 2.0 web page. However, although the semver definition proposes a standard for version specification, it offers no guidance on how to specify version requirements.
As a result, each system has developed its own rules about how requirements are specified and interpreted. Most systems use an algebraic notation to specify requirements, but the operators and their interpretation are highly system-dependent. NuGet and Maven use a set-like notation using closed and open limits defined by brackets and parentheses: [1.2.3,2.3.4) means anything from 1.2.3 up to, but not including, 2.3.4. Most of the others use limit operators such as >, >=, <, <=, and so on. Some implement special operators limiting the range of choices. For example, the ^ operator fixes the major version number, so ^1.2.3 in Cargo notation is equivalent to [1.2.3,2.0.0). Many systems also admit asterisks to act as wildcards, as in 1.2.*.
In the Go system, a requirement is specified by a simple version specifier with no operator, indicating the lowest version of the dependency that is compatible with the version of the module being built.
Cargo provides an algebraic notation for specifying requirements, but also interprets a plain version string as a requirement that, as in Go, specifies the minimum version but unlike Go requires that the dependency match the major version number. That is, the requirement 1.2.3 with no operator means the same as ^1.2.3.
Requirement algebras are not always well documented but an excellent empirical resource exists for npm. The website semver.npmjs.com is an interactive calculator that allows one to explore how a requirement will be evaluated for a given npm package.
Resolution is the process of evaluating the full set of requirements of a version of a package, transitively across all its dependencies, to compute the exact versions of each dependency that the version depends on. In other words, for each dependency, resolution narrows the range defined by the requirements in the packages that require it, resulting in a “resolved” version or versions of each dependency to be installed. This process must be done when a package/version is installed or updated, so that the installed software is built from the components specified by the package maintainer.
In general, the problem is N-P complete, but systems usually adjust the algorithm used, both to simplify its behavior, to reduce its computational complexity, and to create the desired properties of the resolution. Go, for instance, uses minimal version selection, which requires that, when multiple versions match a requirement, the lowest-numbered version is the one to take, since that is arguably the most stable. Most other systems would instead choose the most recent, since that has arguably had more bugs fixed. Some, like Maven, even allow build metadata to influence this selection.
Correct resolution requires satisfying multiple requirements simultaneously. If a package P is a requirement for several dependencies D₁, D₂, … of the top-level version, each Dᵢ may specify a different requirement for P, and the version of P selected by the resolution algorithm must satisfy all of them. For example, if D₁ requires P >=1.2 and D₂ requires P >= 1.3 and P <= 1.9, the version of P chosen must be in the range 1.3 up to 1.9.
Resolution depends not only on the requirements, but also on which versions actually exist. To install a version of P in the range 1.3 through 1.9 requires that there be a version of P available in that range.
In general, there may be no single version that satisfies all the conditions during resolution: the requirements may be incompatible, they may specify a range of versions that do not exist, or they may conflict with external restrictions. What happens in that case is system-dependent. Many systems flag this situation as an error, but not all. Maven, for instance, has the notion of a soft requirement, which if present allows the resolution algorithm to choose another version of P that is available, while npm on the other hand allows the resolution algorithm to choose two or more distinct versions of P in order to satisfy conflicting requirements.
In most systems, a lock file can be supplied as part of the build. This captures the result of a complete precomputed resolution of the version’s dependencies. However, re-evaluating the resolution from the dependency requirements may give a result different from the lock file if, since the lock file was generated, new versions might have been introduced or the dependencies and their transitive requirements have changed in the interim.
Semantic Versioning (semver)
Semantic versioning is the name applied to the method of numerically incrementing a multipart version number as each version is released, in a way that supports backward compatibility and identifies incompatible versions. An emerging standard method of semantic versioning, called Semver 2.0, often just called “semver”, is defined at semver.org. In that variant, a version is identified by a three-part numerical string such as 1.2.3, where the first number identifies a “major” version, the second a “minor” update, and the third a “patch”. Any compatible change requires updating only the patch number; a version that adds symbols to the package interface should update the minor number, and a change that introduces incompatibilities with the previous version should increment the major number. Semver also specifies annotations for versions such as prerelease and build tags; see the full specification for more.
Semver also defines the order of package versions, in order to specify exactly which versions match a given requirement.
To simplify development, Semver states that version 0 has weaker compatibility guarantees. For a given package, version 1.0.0 is the first version that is required to honor compatibility.
In practice, some systems follow the Semver model well, but many do not, and none enforce it. For instance, no packaging system verifies that an incompatible change is identified by a new major version. Testing is required for compliance, even in systems such as Go that depend strongly on such compliance.
Here is a summary of how the various systems approach versioning:
- Cargo: Cargo follows Semver 2.0 fairly closely.
- Go: The Go system follows Semver 2.0, and its Minimal Version Selection algorithm for choosing packages relies on package owners using semver correctly.
- Maven: Maven predates Semver 2.0 and older versions of Maven have an idiosyncratic and complex version identification method. Since Maven 3.0, however, package owners are encouraged to follow semver. Compliance varies.
- npm: The Node ecosystem supports the use of Semver 2.0, but again there are no guarantees and significant non-compliant history.
- PyPI: PyPI predates Semver 2.0. Its approach to versioning is specified by PEP 440, which supports a broad range of numbering schemes.
- NuGet: NuGet follows Semver 2.0 with a few minor exceptions. To be compatible with the System.Version .NET class, an optional fourth part is allowed for identifying the revision.
This list is only a summary. Systems have historical packages, support for variant notations, special ordering rules, and other complexities. The creation of Semver 2.0 was likely motivated by a desire to bring consistency to versioning across the open source environment but there remains a long way to go. In current practice, though, all systems seem to be moving towards Semver 2.0 becoming standard. But despite this progress towards defining versioning rigorously, there has been little progress towards a standard for requirement specification.
A System defines a particular ecosystem developed over a public packaging system such as npm. For example, in Insights the term Cargo, shorthand for the Cargo Crates system, refers to the particular data set that Insights holds to represent the packages managed by the external Cargo website. Depending on matters such as freshness and completeness, Insights’s view of a system may not be in perfect alignment with the external view, but the aim is convergence over time.
There are two categories of system: repository-based, in which the “ground truth” copy of the package is stored in a (typically) Git repository on a hosting site such as GitHub, and package-based, in which the released copy of the package is managed by a separate packaging site such as npm’s npmjs.org. For some systems, such as Maven, Insights’s view may be only a subset of all packages in the corresponding language or ecosystem due to a multiplicity of hosting sites.
A version of a package or module is its source code and associated artifacts (.jar files, for example) frozen at some instant and identified by a version specifier, which may be explicitly defined such as by a semver string (1.2.3, for example), or by a commit hash or other VCS-derived identifier.
Once a version of a package is created and named, although there are exceptions it is usually immutable except perhaps for some metadata. Later versions may be created for the package, but the particular copy of the package identified by the version is set in stone. Similarly, a version identified by a Git commit hash is also fixed.
However, some versions are discovered indirectly by a reference name that must be evaluated to find the version. Version identifiers such as “latest” denote a specific version of the package at the time they are evaluated, but the actual contents of the package—the immutable version—may be different the next time that identifier is evaluated.
It can also happen that a semver-tagged version of a package breaks compatibility without updating the major version number. This is poor practice—one should bump the number when this happens—but if there is no guarantee that the number and contents are updated in synchrony, there can be drift. For well-maintained packages this situation should never occur.
A version specifier is a string that identifies a particular version of a package. The most common type of specifier is a Semver 2.0 string, which in its most common form has the familiar style of a dotted string of three numbers, such as 2.14.3. The first number is the major number, corresponding to a range of versions that are upward compatible with one another. The second number is the minor number, specifying a range of versions with the same major number that are upward compatible with one another. The third number is a patch, which signifies updates to the software that have no bearing on compatibility.
Looking at these numbers another way, the major number increases when incompatible features are introduced, the minor number increases when new features are added without compatibility issues, and the patch number increases when the software is updated but the feature set is unaffected.
Semver also allows for tags called prerelease strings that indicate a version is being prepared for release.
The semver specification includes the rules for a total ordering of version specifiers, allowing an operator in a version requirement to identify a well-defined range of versions.
Note that semantic versioning is an ideal, not a guarantee. It is the package owner’s responsibility to update the version specifier appropriately as features are added or updated, but systems do not tend to verify compliance. Downstream software such as the resolution algorithm depends on compliance, however.
Not all systems honor semver, and even those that do tend to deviate a little. Earlier versions of Maven used a very different notation, although Maven documentation advocates for semver nowadays. All systems must however define the total ordering of versions so that requirement resolution can result in a correct build.
A version may also be specified by a symbolic reference such as “main” or “head” in a Git repository, indicating that the version is whatever commit the reference refers to at the time it is evaluated. This behavior is different from a specifier, where a version string identifies a fixed version.
A version may also be specified by a unique identifier such as a Git commit hash. Although such a specifier is not covered by semver, it is unambiguous, unlike a symbolic reference such as “main”.
A vulnerability, also called a security advisory, is a security-related defect such as is stored in the OSV database or other public repository of security issues. Non-security issues, such as those usually tracked in a project’s own issue tracker, are not tracked by Insights.
Insights tracks the OSV database, updating the information several times each day.