deps.dev BigQuery dataset
Overview
The deps.dev BigQuery dataset provides signals that both maintainers and consumers of open-source software (OSS) can use to improve the health of their software.
The dataset is provided as a BigQuery public dataset.
- Get started with this dataset
- Detailed schema of the available tables
- Advisories
- AdvisoriesLatest
- Dependencies
- DependenciesLatest
- DependencyGraphEdges
- DependencyGraphEdgesLatest
- Dependents
- DependentsLatest
- NuGetRequirements
- NuGetRequirementsLatest
- PackageVersionHashes
- PackageVersionHashesLatest
- PackageVersionToProject
- PackageVersionToProjectLatest
- PackageVersions
- PackageVersionsLatest
- Projects
- ProjectsLatest
- Snapshots
- Data
- Contact us
Getting started
Prerequisites
To use the BigQuery dataset, you will need a Google Cloud account with the BigQuery API enabled. If you already use BigQuery, feel free to skip this section.
-
Sign in to your Google Cloud account. If you’re new to Google Cloud, create an account.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Ensure the BigQuery API is enabled. It will be enabled by default in new projects.
-
Optional: Enable billing for the project. You will not necessarily be charged because BigQuery provides a free usage tier detailed in the pricing documentation. If you do not want to enable billing, you can still explore the data using the BigQuery sandbox, but some queries will fail if the sandbox cannot guarantee they will be below the free usage limits.
-
Optional: Get familiar with BigQuery in the Google Cloud console, how to query public datasets and the fundamentals of BigQuery analytics.
Sample queries
This section contains sample queries that demonstrate some of the ways the dataset can be used.
Using the BigQuery Sandbox it is possible to run small queries and preview table contents without providing a credit card or setting up a billing account. See the prerequisites for more details.
Find advisories by package version
This example demonstrates searching for advisories that disclose vulnerabilities in a set of package versions, such as the dependencies of your application. Using one of the BigQuery client libraries this type of query could be integrated into an automated test for known vulnerabilities in your project’s dependencies.
First we will construct a temporary table that contains the set of package versions that we wish to query. Then to gather the advisories that affected them we join the PackageVersionsLatest view with our temporary table. We will also use the comma operator to unnest the list of advisories that may affect each of these package versions.
CREATE TEMP TABLE Packages (
System string,
Name string,
Version string)
CLUSTER BY
System,
Name,
Version;
INSERT INTO Packages
VALUES ('NPM', 'react', '16.0.0'),
('NPM', 'react-dom', '16.0.0'),
('NPM', 'react-is', '16.13.1'),
('NPM', 'react-router', '5.2.1'),
('NPM', 'qs', '6.3.1'),
('NPM', 'send', '0.1.0');
SELECT
p.System,
p.Name,
p.Version,
Advisory
FROM
`bigquery-public-data.deps_dev_v1.PackageVersionsLatest` AS p,
p.Advisories AS Advisory
JOIN
Packages
USING
(System, Name, Version);
In addition to the source and id of each advisory we can fetch its severity and title. To do this we will join with the AdvisoriesLatest view, and aggregate the advisories that affect a single package version into an array of records.
CREATE TEMP TABLE Packages (
System string,
Name string,
Version string)
CLUSTER BY
System,
Name,
Version;
INSERT INTO Packages
VALUES ('NPM', 'react', '16.0.0'),
('NPM', 'react-dom', '16.0.0'),
('NPM', 'react-is', '16.13.1'),
('NPM', 'react-router', '5.2.1'),
('NPM', 'qs', '6.3.1'),
('NPM', 'send', '0.1.0');
SELECT
p.System,
p.Name,
p.Version,
ARRAY_AGG(
STRUCT(
a.Source,
a.SourceID,
a.Title,
a.Severity,
a.GitHubSeverity)) AS Advisories
FROM
`bigquery-public-data.deps_dev_v1.PackageVersionsLatest` AS p,
p.Advisories AS Advisory
JOIN
Packages
USING
(System, Name, Version)
JOIN
`bigquery-public-data.deps_dev_v1.AdvisoriesLatest` AS a
ON
a.Source = Advisory.Source
AND a.SourceID = Advisory.SourceID
GROUP BY
System,
Name,
Version;
Find licenses by package version
This example demonstrates searching for a set of licenses used by a set of package versions. Like in the previous example this set of package versions may be you application’s dependencies, and you may wish to check the licenses are compatible with your products distribution.
The query has a very similar structure to that used in the previous example. The license information is taken from the PackageVersionsLatest view. But in this query there is no need to join with the Advisories table.
CREATE TEMP TABLE Packages (
System string,
Name string,
Version string)
CLUSTER BY
System,
Name,
Version;
INSERT INTO Packages
VALUES ('NPM', 'react', '16.0.0'),
('NPM', 'react-dom', '16.0.0'),
('NPM', 'react-is', '16.13.1'),
('NPM', 'react-router', '5.2.1'),
('NPM', 'qs', '6.3.1'),
('NPM', 'send', '0.1.0');
SELECT
License,
ARRAY_AGG(
STRUCT(
p.System,
p.Name,
p.Version)) AS Packages
FROM
`bigquery-public-data.deps_dev_v1.PackageVersionsLatest` AS p,
p.Licenses AS License
JOIN
Packages
USING
(System, Name, Version)
GROUP BY
License;
Finding common licenses
This example computes the three most common licenses per system. It is intended to demonstrate simple ecosystem-wide analysis with the deps.dev dataset.
Each version of each package can specify multiple licenses. This is found in the PackageVersions table which contains a column ‘Licenses’ containing an array of licenses. To count the number of distinct packages using each license within each system we unnest the array and count the distinct package names among the resulting rows:
SELECT
p.System,
License,
COUNT(DISTINCT p.Name) AS NPackages
FROM
`bigquery-public-data.deps_dev_v1.PackageVersionsLatest` AS p,
p.Licenses AS License
GROUP BY
System,
License;
If we wish to see just the top 3 licenses per system we can make use of BigQuery’s window functions to compute a ranking:
WITH
Counts AS (
SELECT
p.System,
License,
COUNT(DISTINCT p.Name) AS NPackages
FROM
`bigquery-public-data.deps_dev_v1.PackageVersionsLatest` AS p,
p.Licenses AS License
GROUP BY
System,
License),
Ranked AS (
SELECT
System,
License,
NPackages,
ROW_NUMBER()
OVER (PARTITION BY System ORDER BY NPackages DESC) AS LicenseRank
FROM
Counts)
SELECT
System,
License,
NPackages
FROM
Ranked
WHERE
LicenseRank <= 3
ORDER BY
System,
LicenseRank;
Direct importers of a package
Using the NuGetRequirementsLatest view we can determine which NuGet packages have specified a given package as a requirement.
For ecosystems where dependencies are supported, you can use the DependentsLatest view instead to find importers of a package.
DECLARE
PackageName STRING DEFAULT 'castle.core';
SELECT
R.Name AS Importer,
R.Version As ImporterVersion,
DG.TargetFramework,
RS.Requirement
FROM
`deps-dev-insights.v1.NuGetRequirementsLatest` AS R
CROSS JOIN
UNNEST(R.DependencyGroups) AS DG
CROSS JOIN
UNNEST(DG.Dependencies) AS RS
WHERE
RS.Name = PackageName
ORDER BY
Importer,
ImporterVersion,
RS.Requirement,
DG.TargetFramework;
Dependent count
Using the dependency graphs we can determine which packages have the most dependents within their ecosystem.
In order to compute a count that reflects the current state of the ecosystem we only count the highest release of each package. Here ‘highest’ is determined by the order defined by semantic versioning or the system equivalent. So our query will first filter the set of all available package versions to just the highest release per version. This can be achieved by making use of the VersionInfo column in the PackageVersions table. This column contains information extracted from the version string, allowing us to know which versions are releases and the position of each version in the ordered list of all versions for the corresponding package.
DECLARE
Sys STRING DEFAULT 'CARGO';
SELECT
Name,
Version,
FROM (
SELECT
Name,
Version,
ROW_NUMBER()
OVER (PARTITION BY
Name
ORDER BY
VersionInfo.Ordinal DESC) AS RowNumber
FROM
`bigquery-public-data.deps_dev_v1.PackageVersionsLatest`
WHERE
System = Sys
AND VersionInfo.IsRelease)
WHERE RowNumber = 1;
We will count how many of these ‘highest release’ versions depend on each version available. This can be achieved by using the Dependencies table which contains a relation between pairs of package versions, representing the transitive closure of each dependency graph.
DECLARE
Sys STRING DEFAULT 'CARGO';
WITH
HighestReleases AS (
SELECT
Name,
Version,
FROM (
SELECT
Name,
Version,
ROW_NUMBER()
OVER (PARTITION BY
Name
ORDER BY
VersionInfo.Ordinal DESC) AS RowNumber
FROM
`bigquery-public-data.deps_dev_v1.PackageVersionsLatest`
WHERE
System = Sys
AND VersionInfo.IsRelease)
WHERE RowNumber = 1)
SELECT
D.Dependency.Name,
D.Dependency.Version,
COUNT(*) AS NDependents
FROM
`bigquery-public-data.deps_dev_v1.DependenciesLatest` AS D
JOIN
HighestReleases AS H
USING
(Name, Version)
WHERE
D.System = Sys
GROUP BY
D.Dependency.Name,
D.Dependency.Version
ORDER BY
NDependents DESC
LIMIT
10;
Schema
This section describes the schema used for the tables and views in the dataset.
Advisories
This table stores information about security advisories related to packages from any of npm, Go, Maven, PyPI, Cargo and NuGet. Each row represents a single advisory and can be identified by Source and SourceID.
- SnapshotAt: TIMESTAMP (REQUIRED)
- When this row was exported.
- Source: STRING (REQUIRED)
- The upstream source of this advisory, example values include 'GHSA', 'NSWG' and 'OSV'.
- SourceID: STRING (REQUIRED)
- The id of this advisory within its upstream source.
- SourceURL: STRING (REQUIRED)
- A link to this advisory in its upstream source.
- Title: STRING
- Brief human-readable description of the advisory.
- Description: STRING
- A description of the advisory.
- ReferenceURLs: STRING (REPEATED)
- URLs that elaborate on an advisory: longer descriptions, corroborating evidence, patches, issue trackers, release notes and so on.
- CVSS3Score: NUMERIC
- A numeric measure of severity according to the CVSS3 spec. The measure ranges from 0-10 and is to 1 decimal place.
- Severity: STRING
- A discrete scale of the severity, derived from the CVSS3Score. Example values include 'NONE', 'LOW', 'MEDIUM', 'HIGH', 'CRITICAL', 'UNKNOWN'.
- GitHubSeverity: STRING
- The severity of this advisory as determined by GitHub. Example values include 'LOW', 'MODERATE', 'HIGH', 'CRITICAL', 'UNKNOWN'.
- Disclosed: TIMESTAMP
- The time at which this advisory was publicly disclosed.
- Aliases: STRING (REPEATED)
- Other identifiers used for the advisory, including CVEs.
- Packages: RECORD (REPEATED)
- An array of the packages affected by this advisory.
- Packages.System: STRING (REQUIRED)
- The system of the affected package
- Packages.Name: STRING (REQUIRED)
- The name of the affected package
- Packages.AffectedVersions: STRING (REQUIRED)
- The versions of the package affected by the advisory
- Packages.UnaffectedVersions: STRING (REQUIRED)
- The versions of the package unaffected by the advisory
AdvisoriesLatest
This view contains the most recent snapshot in the Advisories table.
Dependencies
This table stores information about the dependency relationship between versions of packages. Each row represents a dependency of one package-version importing (depending on) another. These dependencies include both direct and indirect dependencies.
The dependencies are determined by computing a resolution of the requirements (dependency constraints) specified by a version. Each dependency graph should be similar to one produced by installing the corresponding package version on a generic 64-bit Linux system, with no other dependencies present. The precise meaning of this varies from system to system.
- SnapshotAt: TIMESTAMP (REQUIRED)
- When this row was exported.
- System: STRING (REQUIRED)
- The dependency management system of the importing package-version in the relation.
- Name: STRING (REQUIRED)
- The name of the importing package-version in the relation.
- Version: STRING (REQUIRED)
- The version of the importing package-version in the relation.
- Dependency: RECORD (REQUIRED)
- The imported package-version.
- Dependency.System: STRING (REQUIRED)
- The system of the imported package-version.
- Dependency.Name: STRING (REQUIRED)
- The name of the imported package-version.
- Dependency.Version: STRING (REQUIRED)
- The version of the imported package-version.
- MinimumDepth: INTEGER
- The minimum depth of this dependency from the root.
DependenciesLatest
This view contains the most recent snapshot in the Dependencies table.
DependencyGraphEdges
This table stores the full dependency graph for each version of each package. Each row represents one edge in a dependency graph. Dependency graphs are only available for npm, Go, Maven, PyPI and Cargo.
The dependency graphs are determined by computing a resolution of the requirements (dependency constraints) specified by a version. Each dependency graph should be similar to one produced by installing the package version on a generic 64-bit Linux system, with no other dependencies present. The precise meaning of this varies from system to system.
- SnapshotAt: TIMESTAMP (REQUIRED)
- When this row was exported.
- System: STRING (REQUIRED)
- The dependency management system of the package-version at the root of the dependency graph.
- Name: STRING (REQUIRED)
- The name of the package-version at the root of the dependency graph.
- Version: STRING (REQUIRED)
- The version of the package-version at the root of the dependency graph.
- Requirement: STRING (REQUIRED)
- A string to represent how this particular dependency was required. Its format is system specific.
- From: RECORD (REQUIRED)
- The importing package-version.
- From.System: STRING (REQUIRED)
- The system of the importing package-version.
- From.Name: STRING (REQUIRED)
- The name of the importing package-version.
- From.Version: STRING (REQUIRED)
- The version of the importing package-version.
- To: RECORD (REQUIRED)
- The imported package-version.
- To.System: STRING (REQUIRED)
- The system of the imported package-version.
- To.Name: STRING (REQUIRED)
- The name of the imported package-version.
- To.Version: STRING (REQUIRED)
- The version of the imported package-version.
DependencyGraphEdgesLatest
This view contains the most recent snapshot in the DependencyGraphEdges table.
Dependents
This table stores information about the dependency relationship between versions of packages. Each row represents a dependency of one package version importing (depending on) another. These dependencies include both direct and indirect dependencies.
The table is clustered on the System, Name and Version of the imported package, allowing for efficient lookup of all versions of all packages that import a specific version of a package. Dependents are only available for npm, Go, Maven, PyPI and Cargo.
The dependent relation is computed by inverting the data in the Dependencies table.
- SnapshotAt: TIMESTAMP (REQUIRED)
- When this row was exported.
- System: STRING (REQUIRED)
- The dependency management system of the imported package version in the relation.
- Name: STRING (REQUIRED)
- The name of the imported package version in the relation.
- Version: STRING (REQUIRED)
- The version of the imported package version in the relation.
- Dependent: RECORD (REQUIRED)
- The importing package version.
- Dependent.System: STRING (REQUIRED)
- The system of the importing package version.
- Dependent.Name: STRING (REQUIRED)
- The name of the importing package version.
- Dependent.Version: STRING (REQUIRED)
- The version of the importing package version.
- MinimumDepth: INTEGER
- The minimum depth of this dependency from the root.
- DependentIsHighestReleaseWithResolution: BOOLEAN
- This value is true if the version of the dependent was the highest release of the respective package with a dependency resolution available at the time of snapshot.
DependentsLatest
This view contains the most recent snapshot in the Dependents table.
NuGetRequirements
This table stores the requirements for NuGet package-versions. Note that the term ‘dependency’ is used within this table to mean ‘a single unresolved requirement’ to be consistent with how the term is used in the NuGet ecosystem. This is different to how it is used elsewhere in this schema.
- SnapshotAt: TIMESTAMP (REQUIRED)
- When this row was exported.
- Name: STRING (REQUIRED)
- The name of the package-version.
- Version: STRING (REQUIRED)
- The version of the package-version.
- DependencyGroups: RECORD (REPEATED)
- The list of requirements grouped by target framework.
- DependencyGroups.TargetFramework: STRING (REQUIRED)
- The target framework that this dependency group is for.
- DependencyGroups.Dependencies: RECORD (REPEATED)
- The requirements for this dependency group.
- DependencyGroups.Dependencies.Name: STRING (REQUIRED)
- The name of the package.
- DependencyGroups.Dependencies.Requirement: STRING (REQUIRED)
- The requirement on the package.
NuGetRequirementsLatest
This view contains the most recent snapshot in the NuGetRequirements table.
PackageVersionHashes
This table can be used to find package-versions from a given hash. Each row represents a single (hash, package-version) pair.
- SnapshotAt: TIMESTAMP (REQUIRED)
- When this row was exported.
- Hash: STRING (REQUIRED)
- A base64 encoded string of the hash.
- HashType: STRING (REQUIRED)
- The type of hash, example values include 'SHA512' and 'SHA1'.
- System: STRING (REQUIRED)
- The dependency management system of the package-version.
- Name: STRING (REQUIRED)
- The name of the package-version.
- Version: STRING (REQUIRED)
- The version of the package-version.
PackageVersionHashesLatest
This view contains the most recent snapshot in the PackageVersionHashes table.
PackageVersionToProject
This table represents the relation between package-versions and project. Each row represents a single pair of package-version and project. This relation is not guaranteed to be authoritative as the owner of a package may list a link to any project and the resulting relationship is not verified.
- SnapshotAt: TIMESTAMP (REQUIRED)
- When this row was exported.
- System: STRING (REQUIRED)
- The dependency management system of the package-version.
- Name: STRING (REQUIRED)
- The name of the package-version.
- Version: STRING (REQUIRED)
- The version of the package-version.
- ProjectType: STRING (REQUIRED)
- The type of the project.
- ProjectName: STRING (REQUIRED)
- The name of the project.
- RelationProvenance: STRING
- How the mapping between project and package version was discovered.
- RelationType: STRING
- What the relationship between the project and the package version is.
PackageVersionToProjectLatest
This view contains the most recent snapshot in the PackageVersionToProject table.
PackageVersions
This table stores information about specific versions of packages. Each row represents a single version of a package.
- SnapshotAt: TIMESTAMP (REQUIRED)
- When this row was exported.
- System: STRING (REQUIRED)
- The dependency management system of the package-version.
- Name: STRING (REQUIRED)
- The name of the package-version.
- Version: STRING (REQUIRED)
- The version of the package-version.
- Purl: STRING
- The purl representing the package-version.
- Licenses: STRING (REPEATED)
- An array of the licenses used by this package-version, presented as SPDX expressions.
- Registries: STRING (REPEATED)
- An array of the package management registries this package-version is available from.
- Links: RECORD (REPEATED)
- An array of links related to this package-version, each consisting of a Label and URL.
- Links.Label: STRING (REQUIRED)
- A label for the corresponding URL value. Example values include 'DOCUMENTATION', 'ISSUE_TRACKER', 'HOMEPAGE' and 'SOURCE_REPO'.
- Links.URL: STRING (REQUIRED)
- The URL value.
- Advisories: RECORD (REPEATED)
- An array of Advisories that affect this package-version directly. This can be used to join against the Advisories table.
- Advisories.Source: STRING (REQUIRED)
- The upstream source of this advisory, example values include 'OSV', 'GHSA' and 'NSWG'.
- Advisories.SourceID: STRING (REQUIRED)
- The id of this advisory within its upstream source.
- VersionInfo: RECORD
- Information extracted from the string value in the Version column of this row.
- VersionInfo.IsRelease: BOOLEAN (REQUIRED)
- The value is true if the version is considered a release, as opposed to a pre-release.
- VersionInfo.Ordinal: INTEGER (REQUIRED)
- The value allows the versions of a package to be sorted in semantic order without parsing the version string. If a package has its versions sorted in ascending order by VersionInfo.Ordinal, the order will correspond to the order of the version strings as determined by the package management system.
- Hashes: RECORD (REPEATED)
- An array of hashes that have been observed from this package-version.
- Hashes.Type: STRING (REQUIRED)
- The type of hash, example values include 'SHA512' and 'SHA1'.
- Hashes.Hash: STRING (REQUIRED)
- A base64 encoded string of the hash.
- UpstreamPublishedAt: TIMESTAMP
- The time at which this package-version was published, as reported by the upstream source.
- UpstreamIdentifiers: RECORD (REPEATED)
- Non-normalized upstream identifiers encountered when syncing this package version.
- UpstreamIdentifiers.PackageName: STRING
- The non-normalized package name string.
- UpstreamIdentifiers.VersionString: STRING
- The non-normalized version string.
- UpstreamIdentifiers.Source: STRING (REQUIRED)
- The upstream source for this identifier.
- SLSAProvenance: RECORD
- Provenance information for this package version. Extracted from a SLSA provenance attestation.
- SLSAProvenance.SourceRepository: STRING (REQUIRED)
- The source code repository used to build this version.
- SLSAProvenance.Commit: STRING (REQUIRED)
- The commit of the source code repository the version was built from.
- SLSAProvenance.URL: STRING (REQUIRED)
- The URL of the provenance statement.
- SLSAProvenance.Verified: BOOLEAN (REQUIRED)
- The Sigstore bundle containing this attestation was verified using the [sigstore-go](https://github.com/sigstore/sigstore-go) library.
- DependenciesProcessed: BOOLEAN
- True if this version's dependency requirements have been processed.
- DependencyError: BOOLEAN
- True if an error was encountered while processing this version's dependency requirements.
PackageVersionsLatest
This view contains the most recent snapshot in the PackageVersions table.
Projects
This table stores information about projects. Each row represents a single project.
- SnapshotAt: TIMESTAMP (REQUIRED)
- When this row was exported.
- Type: STRING (REQUIRED)
- The type of the project, example values include 'APACHE_JIRA', 'BITBUCKET', 'GITHUB' and 'GITLAB'.
- Name: STRING (REQUIRED)
- The name of the project.
- OpenIssuesCount: INTEGER
- The number of open issues.
- StarsCount: INTEGER
- The number of stars in the upstream source.
- ForksCount: INTEGER
- The number of known forks.
- Licenses: STRING (REPEATED)
- A list of licenses for the project, presented as SPDX expressions.
- Description: STRING
- A human readable description of the project.
- Homepage: STRING
- A link to the homepage of the project.
- OSSFuzz: RECORD
- Details of this project’s testing by the OSS-Fuzz service. Only set if the project is tested by OSS-Fuzz.
- OSSFuzz.LineCount: INTEGER
- The number of lines of code in the project.
- OSSFuzz.LineCoverCount: INTEGER
- How many lines of code are covered by fuzzing.
- OSSFuzz.Date: TIMESTAMP
- The date the fuzz test that produced the coverage information was run against this project. The time portion of this field is midnight UTC.
- OSSFuzz.ConfigURL: STRING (REQUIRED)
- The URL containing the configuration for the project in the OSS-Fuzz repository.
ProjectsLatest
This view contains the most recent snapshot in the Projects table.
Snapshots
This table contains the timestamps for which a full snapshot of all other tables exists.
- Time: TIMESTAMP (REQUIRED)
- The time at which a snapshot of all other tables was taken.
Data
deps.dev aggregates data from a number of sources:
- Package data (including package and version names, descriptions, dependency requirements, etc)
- Project data (including project names, descriptions, forks and stars, etc)
- Security advisories
- Associated data
For details on using the data from these sources, please consult their documentation.
As well as aggregating data, deps.dev generates additional data, including resolved dependencies, advisory statistics, associations between entities, etc. This generated data is available under a CC-BY 4.0 license.
Contact us
If you have questions about the dataset, or want to report a problem, please create an issue or contact us at depsdev@google.com.