deps.dev BigQuery dataset

Overview

The deps.dev BigQuery dataset provides signals that both maintainers and consumers of open-source software (OSS) can use to improve the health of their software.

The dataset is provided as a BigQuery public dataset.

Getting started

Prerequisites

To use the BigQuery dataset, you will need a Google Cloud account with the BigQuery API enabled. If you already use BigQuery, feel free to skip this section.

  1. Sign in to your Google Cloud account. If you’re new to Google Cloud, create an account.

  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

  3. Ensure the BigQuery API is enabled. It will be enabled by default in new projects.

  4. Optional: Enable billing for the project. You will not necessarily be charged because BigQuery provides a free usage tier detailed in the pricing documentation. If you do not want to enable billing, you can still explore the data using the BigQuery sandbox, but some queries will fail if the sandbox cannot guarantee they will be below the free usage limits.

  5. Optional: Get familiar with BigQuery in the Google Cloud console, how to query public datasets and the fundamentals of BigQuery analytics.

Sample queries

This section contains sample queries that demonstrate some of the ways the dataset can be used.

Using the BigQuery Sandbox it is possible to run small queries and preview table contents without providing a credit card or setting up a billing account. See the prerequisites for more details.

Find advisories by package version

This example demonstrates searching for advisories that disclose vulnerabilities in a set of package versions, such as the dependencies of your application. Using one of the BigQuery client libraries this type of query could be integrated into an automated test for known vulnerabilities in your project’s dependencies.

First we will construct a temporary table that contains the set of package versions that we wish to query. Then to gather the advisories that affected them we join the PackageVersionsLatest view with our temporary table. We will also use the comma operator to unnest the list of advisories that may affect each of these package versions.

CREATE TEMP TABLE Packages (
  System string,
  Name string,
  Version string)
CLUSTER BY
  System,
  Name,
  Version;

INSERT INTO Packages
VALUES ('NPM', 'react', '16.0.0'),
       ('NPM', 'react-dom', '16.0.0'),
       ('NPM', 'react-is', '16.13.1'),
       ('NPM', 'react-router', '5.2.1'),
       ('NPM', 'qs', '6.3.1'),
       ('NPM', 'send', '0.1.0');

SELECT
  p.System,
  p.Name,
  p.Version,
  Advisory
FROM
  `bigquery-public-data.deps_dev_v1.PackageVersionsLatest` AS p,
  p.Advisories AS Advisory
JOIN
  Packages
USING
  (System, Name, Version);

In addition to the source and id of each advisory we can fetch its severity and title. To do this we will join with the AdvisoriesLatest view, and aggregate the advisories that affect a single package version into an array of records.

CREATE TEMP TABLE Packages (
  System string,
  Name string,
  Version string)
CLUSTER BY
  System,
  Name,
  Version;

INSERT INTO Packages
VALUES ('NPM', 'react', '16.0.0'),
       ('NPM', 'react-dom', '16.0.0'),
       ('NPM', 'react-is', '16.13.1'),
       ('NPM', 'react-router', '5.2.1'),
       ('NPM', 'qs', '6.3.1'),
       ('NPM', 'send', '0.1.0');

SELECT
  p.System,
  p.Name,
  p.Version,
  ARRAY_AGG(
    STRUCT(
      a.Source,
      a.SourceID,
      a.Title,
      a.Severity,
      a.GitHubSeverity)) AS Advisories
FROM
  `bigquery-public-data.deps_dev_v1.PackageVersionsLatest` AS p,
  p.Advisories AS Advisory
JOIN
  Packages
USING
  (System, Name, Version)
JOIN
  `bigquery-public-data.deps_dev_v1.AdvisoriesLatest` AS a
ON
  a.Source = Advisory.Source
  AND a.SourceID = Advisory.SourceID
GROUP BY
  System,
  Name,
  Version;

Find licenses by package version

This example demonstrates searching for a set of licenses used by a set of package versions. Like in the previous example this set of package versions may be you application’s dependencies, and you may wish to check the licenses are compatible with your products distribution.

The query has a very similar structure to that used in the previous example. The license information is taken from the PackageVersionsLatest view. But in this query there is no need to join with the Advisories table.

CREATE TEMP TABLE Packages (
  System string,
  Name string,
  Version string)
CLUSTER BY
  System,
  Name,
  Version;

INSERT INTO Packages
VALUES ('NPM', 'react', '16.0.0'),
       ('NPM', 'react-dom', '16.0.0'),
       ('NPM', 'react-is', '16.13.1'),
       ('NPM', 'react-router', '5.2.1'),
       ('NPM', 'qs', '6.3.1'),
       ('NPM', 'send', '0.1.0');

SELECT
  License,
  ARRAY_AGG(
    STRUCT(
      p.System,
      p.Name,
      p.Version)) AS Packages
FROM
  `bigquery-public-data.deps_dev_v1.PackageVersionsLatest` AS p,
  p.Licenses AS License
JOIN
  Packages
USING
  (System, Name, Version)
GROUP BY
  License;

Finding common licenses

This example computes the three most common licenses per system. It is intended to demonstrate simple ecosystem-wide analysis with the deps.dev dataset.

Each version of each package can specify multiple licenses. This is found in the PackageVersions table which contains a column ‘Licenses’ containing an array of licenses. To count the number of distinct packages using each license within each system we unnest the array and count the distinct package names among the resulting rows:

SELECT
  p.System,
  License,
  COUNT(DISTINCT p.Name) AS NPackages
FROM
  `bigquery-public-data.deps_dev_v1.PackageVersionsLatest` AS p,
  p.Licenses AS License
GROUP BY
  System,
  License;

If we wish to see just the top 3 licenses per system we can make use of BigQuery’s window functions to compute a ranking:

WITH
  Counts AS (
    SELECT
      p.System,
      License,
      COUNT(DISTINCT p.Name) AS NPackages
    FROM
      `bigquery-public-data.deps_dev_v1.PackageVersionsLatest` AS p,
      p.Licenses AS License
    GROUP BY
      System,
      License),

  Ranked AS (
    SELECT
      System,
      License,
      NPackages,
      ROW_NUMBER()
        OVER (PARTITION BY System ORDER BY NPackages DESC) AS LicenseRank
    FROM
      Counts)

SELECT
  System,
  License,
  NPackages
FROM
  Ranked
WHERE
  LicenseRank <= 3
ORDER BY
  System,
  LicenseRank;

Direct importers of a package

Using the NuGetRequirementsLatest view we can determine which NuGet packages have specified a given package as a requirement.

For ecosystems where dependencies are supported, you can use the DependentsLatest view instead to find importers of a package.

DECLARE
  PackageName STRING DEFAULT 'castle.core';

SELECT
  R.Name AS Importer,
  R.Version As ImporterVersion,
  DG.TargetFramework,
  RS.Requirement
FROM
  `deps-dev-insights.v1.NuGetRequirementsLatest` AS R
CROSS JOIN
  UNNEST(R.DependencyGroups) AS DG
CROSS JOIN
  UNNEST(DG.Dependencies) AS RS
WHERE
  RS.Name = PackageName
ORDER BY
  Importer,
  ImporterVersion,
  RS.Requirement,
  DG.TargetFramework;

Dependent count

Using the dependency graphs we can determine which packages have the most dependents within their ecosystem.

In order to compute a count that reflects the current state of the ecosystem we only count the highest release of each package. Here ‘highest’ is determined by the order defined by semantic versioning or the system equivalent. So our query will first filter the set of all available package versions to just the highest release per version. This can be achieved by making use of the VersionInfo column in the PackageVersions table. This column contains information extracted from the version string, allowing us to know which versions are releases and the position of each version in the ordered list of all versions for the corresponding package.

DECLARE
  Sys STRING DEFAULT 'CARGO';

SELECT
  Name,
  Version,
FROM (
  SELECT
    Name,
    Version,
    ROW_NUMBER()
      OVER (PARTITION BY
              Name
            ORDER BY
              VersionInfo.Ordinal DESC) AS RowNumber
    FROM
      `bigquery-public-data.deps_dev_v1.PackageVersionsLatest`
    WHERE
      System = Sys
      AND VersionInfo.IsRelease)
WHERE RowNumber = 1;

We will count how many of these ‘highest release’ versions depend on each version available. This can be achieved by using the Dependencies table which contains a relation between pairs of package versions, representing the transitive closure of each dependency graph.

DECLARE
  Sys STRING DEFAULT 'CARGO';

WITH
  HighestReleases AS (
    SELECT
      Name,
      Version,
    FROM (
      SELECT
        Name,
        Version,
        ROW_NUMBER()
          OVER (PARTITION BY
                  Name
                ORDER BY
                  VersionInfo.Ordinal DESC) AS RowNumber
        FROM
          `bigquery-public-data.deps_dev_v1.PackageVersionsLatest`
        WHERE
          System = Sys
          AND VersionInfo.IsRelease)
    WHERE RowNumber = 1)

SELECT
  D.Dependency.Name,
  D.Dependency.Version,
  COUNT(*) AS NDependents
FROM
  `bigquery-public-data.deps_dev_v1.DependenciesLatest` AS D
JOIN
  HighestReleases AS H
USING
  (Name, Version)
WHERE
  D.System = Sys
GROUP BY
  D.Dependency.Name,
  D.Dependency.Version
ORDER BY
  NDependents DESC
LIMIT
  10;

Schema

This section describes the schema used for the tables and views in the dataset.

Advisories

This table stores information about security advisories related to packages from any of npm, Go, Maven, PyPI, Cargo and NuGet. Each row represents a single advisory and can be identified by Source and SourceID.

SnapshotAt: TIMESTAMP (REQUIRED)
When this row was exported.
Source: STRING (REQUIRED)
The upstream source of this advisory, example values include 'GHSA', 'NSWG' and 'OSV'.
SourceID: STRING (REQUIRED)
The id of this advisory within its upstream source.
SourceURL: STRING (REQUIRED)
A link to this advisory in its upstream source.
Title: STRING
Brief human-readable description of the advisory.
Description: STRING
A description of the advisory.
ReferenceURLs: STRING (REPEATED)
URLs that elaborate on an advisory: longer descriptions, corroborating evidence, patches, issue trackers, release notes and so on.
CVSS3Score: NUMERIC
A numeric measure of severity according to the CVSS3 spec. The measure ranges from 0-10 and is to 1 decimal place.
Severity: STRING
A discrete scale of the severity, derived from the CVSS3Score. Example values include 'NONE', 'LOW', 'MEDIUM', 'HIGH', 'CRITICAL', 'UNKNOWN'.
GitHubSeverity: STRING
The severity of this advisory as determined by GitHub. Example values include 'LOW', 'MODERATE', 'HIGH', 'CRITICAL', 'UNKNOWN'.
Disclosed: TIMESTAMP
The time at which this advisory was publicly disclosed.
Aliases: STRING (REPEATED)
Other identifiers used for the advisory, including CVEs.
Packages: RECORD (REPEATED)
An array of the packages affected by this advisory.
PackagesSystem: STRING (REQUIRED)
The system of the affected package
PackagesName: STRING (REQUIRED)
The name of the affected package
PackagesAffectedVersions: STRING (REQUIRED)
The versions of the package affected by the advisory
PackagesUnaffectedVersions: STRING (REQUIRED)
The versions of the package unaffected by the advisory

AdvisoriesLatest

This view contains the most recent snapshot in the Advisories table.

Dependencies

This table stores information about the dependency relationship between versions of packages. Each row represents a dependency of one package-version importing (depending on) another. These dependencies include both direct and indirect dependencies.

The dependencies are determined by computing a resolution of the requirements (dependency constraints) specified by a version. Each dependency graph should be similar to one produced by installing the corresponding package version on a generic 64-bit Linux system, with no other dependencies present. The precise meaning of this varies from system to system.

SnapshotAt: TIMESTAMP (REQUIRED)
When this row was exported.
System: STRING (REQUIRED)
The dependency management system of the importing package-version in the relation.
Name: STRING (REQUIRED)
The name of the importing package-version in the relation.
Version: STRING (REQUIRED)
The version of the importing package-version in the relation.
Dependency: RECORD (REQUIRED)
The imported package-version.
DependencySystem: STRING (REQUIRED)
The system of the imported package-version.
DependencyName: STRING (REQUIRED)
The name of the imported package-version.
DependencyVersion: STRING (REQUIRED)
The version of the imported package-version.
MinimumDepth: INTEGER
The minimum depth of this dependency from the root.

DependenciesLatest

This view contains the most recent snapshot in the Dependencies table.

DependencyGraphEdges

This table stores the full dependency graph for each version of each package. Each row represents one edge in a dependency graph. Dependency graphs are only available for npm, Go, Maven, PyPI and Cargo.

The dependency graphs are determined by computing a resolution of the requirements (dependency constraints) specified by a version. Each dependency graph should be similar to one produced by installing the package version on a generic 64-bit Linux system, with no other dependencies present. The precise meaning of this varies from system to system.

SnapshotAt: TIMESTAMP (REQUIRED)
When this row was exported.
System: STRING (REQUIRED)
The dependency management system of the package-version at the root of the dependency graph.
Name: STRING (REQUIRED)
The name of the package-version at the root of the dependency graph.
Version: STRING (REQUIRED)
The version of the package-version at the root of the dependency graph.
Requirement: STRING (REQUIRED)
A string to represent how this particular dependency was required. Its format is system specific.
From: RECORD (REQUIRED)
The importing package-version.
FromSystem: STRING (REQUIRED)
The system of the importing package-version.
FromName: STRING (REQUIRED)
The name of the importing package-version.
FromVersion: STRING (REQUIRED)
The version of the importing package-version.
To: RECORD (REQUIRED)
The imported package-version.
ToSystem: STRING (REQUIRED)
The system of the imported package-version.
ToName: STRING (REQUIRED)
The name of the imported package-version.
ToVersion: STRING (REQUIRED)
The version of the imported package-version.

DependencyGraphEdgesLatest

This view contains the most recent snapshot in the DependencyGraphEdges table.

Dependents

This table stores information about the dependency relationship between versions of packages. Each row represents a dependency of one package version importing (depending on) another. These dependencies include both direct and indirect dependencies.

The table is clustered on the System, Name and Version of the imported package, allowing for efficient lookup of all versions of all packages that import a specific version of a package. Dependents are only available for npm, Go, Maven, PyPI and Cargo.

The dependent relation is computed by inverting the data in the Dependencies table.

SnapshotAt: TIMESTAMP (REQUIRED)
When this row was exported.
System: STRING (REQUIRED)
The dependency management system of the imported package version in the relation.
Name: STRING (REQUIRED)
The name of the imported package version in the relation.
Version: STRING (REQUIRED)
The version of the imported package version in the relation.
Dependent: RECORD (REQUIRED)
The importing package version.
DependentSystem: STRING (REQUIRED)
The system of the importing package version.
DependentName: STRING (REQUIRED)
The name of the importing package version.
DependentVersion: STRING (REQUIRED)
The version of the importing package version.
MinimumDepth: INTEGER
The minimum depth of this dependency from the root.
DependentIsHighestReleaseWithResolution: BOOLEAN
This value is true if the version of the dependent was the highest release of the respective package with a dependency resolution available at the time of snapshot.

DependentsLatest

This view contains the most recent snapshot in the Dependents table.

NuGetRequirements

This table stores the requirements for NuGet package-versions. Note that the term ‘dependency’ is used within this table to mean ‘a single unresolved requirement’ to be consistent with how the term is used in the NuGet ecosystem. This is different to how it is used elsewhere in this schema.

SnapshotAt: TIMESTAMP (REQUIRED)
When this row was exported.
Name: STRING (REQUIRED)
The name of the package-version.
Version: STRING (REQUIRED)
The version of the package-version.
DependencyGroups: RECORD (REPEATED)
The list of requirements grouped by target framework.
DependencyGroupsTargetFramework: STRING (REQUIRED)
The target framework that this dependency group is for.
DependencyGroupsDependencies: RECORD (REPEATED)
The requirements for this dependency group.
DependencyGroupsDependenciesName: STRING (REQUIRED)
The name of the package.
DependencyGroupsDependenciesRequirement: STRING (REQUIRED)
The requirement on the package.

NuGetRequirementsLatest

This view contains the most recent snapshot in the NuGetRequirements table.

PackageVersionHashes

This table can be used to find package-versions from a given hash. Each row represents a single (hash, package-version) pair.

SnapshotAt: TIMESTAMP (REQUIRED)
When this row was exported.
Hash: STRING (REQUIRED)
A base64 encoded string of the hash.
HashType: STRING (REQUIRED)
The type of hash, example values include 'SHA512' and 'SHA1'.
System: STRING (REQUIRED)
The dependency management system of the package-version.
Name: STRING (REQUIRED)
The name of the package-version.
Version: STRING (REQUIRED)
The version of the package-version.

PackageVersionHashesLatest

This view contains the most recent snapshot in the PackageVersionHashes table.

PackageVersionToProject

This table represents the relation between package-versions and project. Each row represents a single pair of package-version and project. This relation is not guaranteed to be authoritative as the owner of a package may list a link to any project and the resulting relationship is not verified.

SnapshotAt: TIMESTAMP (REQUIRED)
When this row was exported.
System: STRING (REQUIRED)
The dependency management system of the package-version.
Name: STRING (REQUIRED)
The name of the package-version.
Version: STRING (REQUIRED)
The version of the package-version.
ProjectType: STRING (REQUIRED)
The type of the project.
ProjectName: STRING (REQUIRED)
The name of the project.

PackageVersionToProjectLatest

This view contains the most recent snapshot in the PackageVersionToProject table.

PackageVersions

This table stores information about specific versions of packages. Each row represents a single version of a package.

SnapshotAt: TIMESTAMP (REQUIRED)
When this row was exported.
System: STRING (REQUIRED)
The dependency management system of the package-version.
Name: STRING (REQUIRED)
The name of the package-version.
Version: STRING (REQUIRED)
The version of the package-version.
Licenses: STRING (REPEATED)
An array of the licenses used by this package-version, presented as SPDX expressions.
Registries: STRING (REPEATED)
An array of the package management registries this package-version is available from.
Links: RECORD (REPEATED)
An array of links related to this package-version, each consisting of a Label and URL.
LinksLabel: STRING (REQUIRED)
A label for the corresponding URL value. Example values include 'DOCUMENTATION', 'ISSUE_TRACKER', 'HOMEPAGE' and 'SOURCE_REPO'.
LinksURL: STRING (REQUIRED)
The URL value.
Advisories: RECORD (REPEATED)
An array of Advisories that affect this package-version directly. This can be used to join against the Advisories table.
AdvisoriesSource: STRING (REQUIRED)
The upstream source of this advisory, example values include 'OSV', 'GHSA' and 'NSWG'.
AdvisoriesSourceID: STRING (REQUIRED)
The id of this advisory within its upstream source.
VersionInfo: RECORD
Information extracted from the string value in the Version column of this row.
VersionInfoIsRelease: BOOLEAN (REQUIRED)
The value is true if the version is considered a release, as opposed to a pre-release.
VersionInfoOrdinal: INTEGER (REQUIRED)
The value allows the versions of a package to be sorted in semantic order without parsing the version string. If a package has its versions sorted in ascending order by VersionInfo.Ordinal, the order will correspond to the order of the version strings as determined by the package management system.
Hashes: RECORD (REPEATED)
An array of hashes that have been observed from this package-version.
HashesType: STRING (REQUIRED)
The type of hash, example values include 'SHA512' and 'SHA1'.
HashesHash: STRING (REQUIRED)
A base64 encoded string of the hash.
UpstreamPublishedAt: TIMESTAMP
The time at which this package-version was published, as reported by the upstream source.
SLSAProvenance: RECORD
Provenance information for this package version. Extracted from a SLSA provenance attestation.
SLSAProvenanceSourceRepository: STRING (REQUIRED)
The source code repository used to build this version.
SLSAProvenanceCommit: STRING (REQUIRED)
The commit of the source code repository the version was built from.
SLSAProvenanceURL: STRING (REQUIRED)
The URL of the provenance statement.
SLSAProvenanceVerified: BOOLEAN (REQUIRED)
The Sigstore bundle containing this attestation was verified using the [sigstore-go](https://github.com/sigstore/sigstore-go) library.
DependenciesProcessed: BOOLEAN
True if this version's dependency requirements have been processed.
DependencyError: BOOLEAN
True if an error was encountered while processing this version's dependency requirements.

PackageVersionsLatest

This view contains the most recent snapshot in the PackageVersions table.

Projects

This table stores information about projects. Each row represents a single project.

SnapshotAt: TIMESTAMP (REQUIRED)
When this row was exported.
Type: STRING (REQUIRED)
The type of the project, example values include 'APACHE_JIRA', 'BITBUCKET', 'GITHUB' and 'GITLAB'.
Name: STRING (REQUIRED)
The name of the project.
OpenIssuesCount: INTEGER
The number of open issues.
StarsCount: INTEGER
The number of stars in the upstream source.
ForksCount: INTEGER
The number of known forks.
Licenses: STRING (REPEATED)
A list of licenses for the project, presented as SPDX expressions.
Description: STRING
A human readable description of the project.
Homepage: STRING
A link to the homepage of the project.
OSSFuzz: RECORD
Details of this project’s testing by the OSS-Fuzz service. Only set if the project is tested by OSS-Fuzz.
OSSFuzzLineCount: INTEGER
The number of lines of code in the project.
OSSFuzzLineCoverCount: INTEGER
How many lines of code are covered by fuzzing.
OSSFuzzDate: TIMESTAMP
The date the fuzz test that produced the coverage information was run against this project. The time portion of this field is midnight UTC.
OSSFuzzConfigURL: STRING (REQUIRED)
The URL containing the configuration for the project in the OSS-Fuzz repository.

ProjectsLatest

This view contains the most recent snapshot in the Projects table.

Snapshots

This table contains the timestamps for which a full snapshot of all other tables exists.

Time: TIMESTAMP (REQUIRED)
The time at which a snapshot of all other tables was taken.

Data

deps.dev aggregates data from a number of sources:

For details on using the data from these sources, please consult their documentation.

As well as aggregating data, deps.dev generates additional data, including resolved dependencies, advisory statistics, associations between entities, etc. This generated data is available under a CC-BY 4.0 license.

Contact us

If you have questions about the dataset, or want to report a problem, please create an issue or contact us at depsdev@google.com.