Open Source Insights’ BigQuery dataset

Overview

The Open Source Insights dataset provides signals that both maintainers and consumers of open-source software (OSS) can use to improve the health of their software.

The dataset is provided as a BigQuery public dataset.

Getting started

Prerequisites

To use the BigQuery dataset, you will need a Google Cloud account with the BigQuery API enabled. If you already use BigQuery, feel free to skip this section.

  1. Sign in to your Google Cloud account. If you’re new to Google Cloud, create an account.

  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

  3. Ensure the BigQuery API is enabled. It will be enabled by default in new projects.

  4. Optional: Enable billing for the project. You will not necessarily be charged because BigQuery provides a free usage tier detailed in the pricing documentation. If you do not want to enable billing, you can still explore the data using the BigQuery sandbox, but some queries will fail if the sandbox cannot guarantee they will be below the free usage limits.

  5. Optional: Get familiar with BigQuery in the Google Cloud console, how to query public datasets and the fundamentals of BigQuery analytics.

Sample queries

This section contains sample queries that demonstrate some of the ways the dataset can be used.

Using the BigQuery Sandbox it is possible to run small queries and preview table contents without providing a credit card or setting up a billing account. See the prerequisites for more details.

Find advisories by package version

This example demonstrates searching for advisories that disclose vulnerabilities in a set of package versions, such as the dependencies of your application. Using one of the BigQuery client libraries this type of query could be integrated into an automated test for known vulnerabilities in your project’s dependencies.

First we will construct a temporary table that contains the set of package versions that we wish to query. Then to gather the advisories that affected them we join the PackageVersionsLatest view with our temporary table. We will also use the comma operator to unnest the list of advisories that may affect each of these package versions.

CREATE TEMP TABLE Packages (
  System string,
  Name string,
  Version string)
CLUSTER BY
  System,
  Name,
  Version;

INSERT INTO Packages
VALUES ('NPM', 'react', '16.0.0'),
       ('NPM', 'react-dom', '16.0.0'),
       ('NPM', 'react-is', '16.13.1'),
       ('NPM', 'react-router', '5.2.1'),
       ('NPM', 'qs', '6.3.1'),
       ('NPM', 'send', '0.1.0');

SELECT
  p.System,
  p.Name,
  p.Version,
  Advisory
FROM
  `bigquery-public-data.deps_dev_v1.PackageVersionsLatest` AS p,
  p.Advisories AS Advisory
JOIN
  Packages
USING
  (System, Name, Version);

In addition to the source and id of each advisory we can fetch its severity and title. To do this we will join with the AdvisoriesLatest view, and aggregate the advisories that affect a single package version into an array of records.

CREATE TEMP TABLE Packages (
  System string,
  Name string,
  Version string)
CLUSTER BY
  System,
  Name,
  Version;

INSERT INTO Packages
VALUES ('NPM', 'react', '16.0.0'),
       ('NPM', 'react-dom', '16.0.0'),
       ('NPM', 'react-is', '16.13.1'),
       ('NPM', 'react-router', '5.2.1'),
       ('NPM', 'qs', '6.3.1'),
       ('NPM', 'send', '0.1.0');

SELECT
  p.System,
  p.Name,
  p.Version,
  ARRAY_AGG(
    STRUCT(
      a.Source,
      a.SourceID,
      a.Title,
      a.Severity,
      a.GitHubSeverity)) AS Advisories
FROM
  `bigquery-public-data.deps_dev_v1.PackageVersionsLatest` AS p,
  p.Advisories AS Advisory
JOIN
  Packages
USING
  (System, Name, Version)
JOIN
  `bigquery-public-data.deps_dev_v1.AdvisoriesLatest` AS a
ON
  a.Source = Advisory.Source
  AND a.SourceID = Advisory.SourceID
GROUP BY
  System,
  Name,
  Version;

Find licenses by package version

This example demonstrates searching for a set of licenses used by a set of package versions. Like in the previous example this set of package versions may be you application’s dependencies, and you may wish to check the licenses are compatible with your products distribution.

The query has a very similar structure to that used in the previous example. The license information is taken from the PackageVersionsLatest view. But in this query there is no need to join with the Advisories table.

CREATE TEMP TABLE Packages (
  System string,
  Name string,
  Version string)
CLUSTER BY
  System,
  Name,
  Version;

INSERT INTO Packages
VALUES ('NPM', 'react', '16.0.0'),
       ('NPM', 'react-dom', '16.0.0'),
       ('NPM', 'react-is', '16.13.1'),
       ('NPM', 'react-router', '5.2.1'),
       ('NPM', 'qs', '6.3.1'),
       ('NPM', 'send', '0.1.0');

SELECT
  License,
  ARRAY_AGG(
    STRUCT(
      p.System,
      p.Name,
      p.Version)) AS Packages
FROM
  `bigquery-public-data.deps_dev_v1.PackageVersionsLatest` AS p,
  p.Licenses AS License
JOIN
  Packages
USING
  (System, Name, Version)
GROUP BY
  License;

Finding common licenses

This example computes the three most common licenses per system. It is intended to demonstrate simple ecosystem-wide analysis with the Open Source Insights dataset.

Each version of each package can specify multiple licenses. This is found in the PackageVersions table which contains a column ‘Licenses’ containing an array of licenses. To count the number of distinct packages using each license within each system we unnest the array and count the distinct package names among the resulting rows:

SELECT
  p.System,
  License,
  COUNT(DISTINCT p.Name) AS NPackages
FROM
  `bigquery-public-data.deps_dev_v1.PackageVersionsLatest` AS p,
  p.Licenses AS License
GROUP BY
  System,
  License;

If we wish to see just the top 3 licenses per system we can make use of BigQuery’s window functions to compute a ranking:

WITH
  Counts AS (
    SELECT
      p.System,
      License,
      COUNT(DISTINCT p.Name) AS NPackages
    FROM
      `bigquery-public-data.deps_dev_v1.PackageVersionsLatest` AS p,
      p.Licenses AS License
    GROUP BY
      System,
      License),

  Ranked AS (
    SELECT
      System,
      License,
      NPackages,
      ROW_NUMBER()
        OVER (PARTITION BY System ORDER BY NPackages DESC) AS LicenseRank
    FROM
      Counts)

SELECT
  System,
  License,
  NPackages
FROM
  Ranked
WHERE
  LicenseRank <= 3
ORDER BY
  System,
  LicenseRank;

Dependent count

Using the dependency graphs we can determine which packages have the most dependents within their ecosystem.

In order to compute a count that reflects the current state of the ecosystem we only count the highest release of each package. Here ‘highest’ is determined by the order defined by semantic versioning or the system equivalent. So our query will first filter the set of all available package versions to just the highest release per version. This can be achieved by making use of the VersionInfo column in the PackageVersions table. This column contains information extracted from the version string, allowing us to know which versions are releases and the position of each version in the ordered list of all versions for the corresponding package.

DECLARE
  Sys STRING DEFAULT 'CARGO';

SELECT
  Name,
  Version,
FROM (
  SELECT
    Name,
    Version,
    ROW_NUMBER()
      OVER (PARTITION BY
              Name
            ORDER BY
              VersionInfo.Ordinal DESC) AS RowNumber
    FROM
      `bigquery-public-data.deps_dev_v1.PackageVersionsLatest`
    WHERE
      System = Sys
      AND VersionInfo.IsRelease)
WHERE RowNumber = 1;

We will count how many of these ‘highest release’ versions depend on each version available. This can be achieved by using the Dependencies table which contains a relation between pairs of package versions, representing the transitive closure of each dependency graph.

DECLARE
  Sys STRING DEFAULT 'CARGO';

WITH
  HighestReleases AS (
    SELECT
      Name,
      Version,
    FROM (
      SELECT
        Name,
        Version,
        ROW_NUMBER()
          OVER (PARTITION BY
                  Name
                ORDER BY
                  VersionInfo.Ordinal DESC) AS RowNumber
        FROM
          `bigquery-public-data.deps_dev_v1.PackageVersionsLatest`
        WHERE
          System = Sys
          AND VersionInfo.IsRelease)
    WHERE RowNumber = 1)

SELECT
  D.Dependency.Name,
  D.Dependency.Version,
  COUNT(*) AS NDependents
FROM
  `bigquery-public-data.deps_dev_v1.DependenciesLatest` AS D
JOIN
  HighestReleases AS H
USING
  (Name, Version)
WHERE
  D.System = Sys
GROUP BY
  D.Dependency.Name,
  D.Dependency.Version
ORDER BY
  NDependents DESC
LIMIT
  10;

Schema

This section describes the schema used for the tables and views in the dataset.

Advisories

This table stores information about security advisories related to packages from any of npm, Go, Maven, PyPI and Cargo. Each row represents a single advisory and can be identified by Source and SourceID.

SnapshotAt: TIMESTAMP (REQUIRED)
When this row was exported.
Source: STRING (REQUIRED)
The upstream source of this advisory, example values include 'GHSA', 'NSWG' and 'OSV'.
SourceID: STRING (REQUIRED)
The id of this advisory within its upstream source.
SourceURL: STRING (REQUIRED)
A link to this advisory in its upstream source.
Title: STRING
Brief human-readable description of the advisory.
Description: STRING
A description of the advisory.
ReferenceURLs: STRING (REPEATED)
URLs that elaborate on an advisory: longer descriptions, corroborating evidence, patches, issue trackers, release notes and so on.
CVSS3Score: NUMERIC
A numeric measure of severity according to the CVSS3 spec. The measure ranges from 0-10 and is to 1 decimal place.
Severity: STRING
A discrete scale of the severity, derived from the CVSS3Score. Example values include 'NONE', 'LOW', 'MEDIUM', 'HIGH', 'CRITICAL', 'UNKNOWN'.
GitHubSeverity: STRING
The severity of this advisory as determined by GitHub. Example values include 'LOW', 'MODERATE', 'HIGH', 'CRITICAL', 'UNKNOWN'.
Disclosed: TIMESTAMP
The time at which this advisory was publicly disclosed.
Packages: RECORD (REPEATED)
An array of the packages affected by this advisory.
PackagesSystem: STRING (REQUIRED)
The system of the affected package
PackagesName: STRING (REQUIRED)
The name of the affected package
PackagesAffectedVersions: STRING (REQUIRED)
The versions of the package affected by the advisory
PackagesUnaffectedVersions: STRING (REQUIRED)
The versions of the package unaffected by the advisory

AdvisoriesLatest

This view contains the most recent snapshot in the Advisories table.

Dependencies

This table stores information about the dependency relationship between versions of packages. Each row represents a dependency of one package-version importing (depending on) another. These dependencies include both direct and indirect dependencies.

SnapshotAt: TIMESTAMP (REQUIRED)
When this row was exported.
System: STRING (REQUIRED)
The dependency management system of the importing package-version in the relation.
Name: STRING (REQUIRED)
The name of the importing package-version in the relation.
Version: STRING (REQUIRED)
The version of the importing package-version in the relation.
Dependency: RECORD (REQUIRED)
The imported package-version.
DependencySystem: STRING (REQUIRED)
The system of the imported package-version.
DependencyName: STRING (REQUIRED)
The name of the imported package-version.
DependencyVersion: STRING (REQUIRED)
The version of the imported package-version.
MinimumDepth: INTEGER
The minimum depth of this dependency from the root.

DependenciesLatest

This view contains the most recent snapshot in the Dependencies table.

DependencyGraphEdges

This table stores the full dependency graph for each version of each package. Each row represents one edge in a dependency graph.

SnapshotAt: TIMESTAMP (REQUIRED)
When this row was exported.
System: STRING (REQUIRED)
The dependency management system of the package-version at the root of the dependency graph.
Name: STRING (REQUIRED)
The name of the package-version at the root of the dependency graph.
Version: STRING (REQUIRED)
The version of the package-version at the root of the dependency graph.
Requirement: STRING (REQUIRED)
A string to represent how this particular dependency was required. Its format is system specific.
From: RECORD (REQUIRED)
The importing package-version.
FromSystem: STRING (REQUIRED)
The system of the importing package-version.
FromName: STRING (REQUIRED)
The name of the importing package-version.
FromVersion: STRING (REQUIRED)
The version of the importing package-version.
To: RECORD (REQUIRED)
The imported package-version.
ToSystem: STRING (REQUIRED)
The system of the imported package-version.
ToName: STRING (REQUIRED)
The name of the imported package-version.
ToVersion: STRING (REQUIRED)
The version of the imported package-version.

DependencyGraphEdgesLatest

This view contains the most recent snapshot in the DependencyGraphEdges table.

PackageVersionHashes

This table can be used to find package-versions from a given hash. Each row represents a single (hash, package-version) pair.

SnapshotAt: TIMESTAMP (REQUIRED)
When this row was exported.
Hash: STRING (REQUIRED)
A base64 encoded string of the hash.
HashType: STRING (REQUIRED)
The type of hash, example values include 'SHA512' and 'SHA1'.
System: STRING (REQUIRED)
The dependency management system of the package-version.
Name: STRING (REQUIRED)
The name of the package-version.
Version: STRING (REQUIRED)
The version of the package-version.

PackageVersionHashesLatest

This view contains the most recent snapshot in the PackageVersionHashes table.

PackageVersionToProject

This table represents the relation between package-versions and project. Each row represents a single pair of package-version and project. This relation is not guaranteed to be authoritative as the owner of a package may list a link to any project and the resulting relationship is not verified.

SnapshotAt: TIMESTAMP (REQUIRED)
When this row was exported.
System: STRING (REQUIRED)
The dependency management system of the package-version.
Name: STRING (REQUIRED)
The name of the package-version.
Version: STRING (REQUIRED)
The version of the package-version.
ProjectType: STRING (REQUIRED)
The type of the project.
ProjectName: STRING (REQUIRED)
The name of the project.

PackageVersionToProjectLatest

This view contains the most recent snapshot in the PackageVersionToProject table.

PackageVersions

This table stores information about specific versions of packages. Each row represents a single version of a package.

SnapshotAt: TIMESTAMP (REQUIRED)
When this row was exported.
System: STRING (REQUIRED)
The dependency management system of the package-version.
Name: STRING (REQUIRED)
The name of the package-version.
Version: STRING (REQUIRED)
The version of the package-version.
Licenses: STRING (REPEATED)
An array of the licenses used by this package-version, presented as SPDX expressions.
Links: RECORD (REPEATED)
An array of links related to this package-version, each consisting of a Label and URL.
LinksLabel: STRING (REQUIRED)
A label for the corresponding URL value. Example values include 'DOCUMENTATION', 'ISSUE_TRACKER', 'HOMEPAGE' and 'SOURCE_REPO'.
LinksURL: STRING (REQUIRED)
The URL value.
Advisories: RECORD (REPEATED)
An array of Advisories that affect this package-version directly. This can be used to join against the Advisories table.
AdvisoriesSource: STRING (REQUIRED)
The upstream source of this advisory, example values include 'OSV', 'GHSA' and 'NSWG'.
AdvisoriesSourceID: STRING (REQUIRED)
The id of this advisory within its upstream source.
VersionInfo: RECORD
Information extracted from the string value in the Version column of this row.
VersionInfoIsRelease: BOOLEAN (REQUIRED)
The value is true if the version is considered a release, as opposed to a pre-release.
VersionInfoOrdinal: INTEGER (REQUIRED)
The value allows the versions of a package to be sorted in semantic order without parsing the version string. If a package has its versions sorted in ascending order by VersionInfo.Ordinal, the order will correspond to the order of the version strings as determined by the package management system.
Hashes: RECORD (REPEATED)
An array of hashes that have been observed from this package-version.
HashesType: STRING (REQUIRED)
The type of hash, example values include 'SHA512' and 'SHA1'.
HashesHash: STRING (REQUIRED)
A base64 encoded string of the hash.

PackageVersionsLatest

This view contains the most recent snapshot in the PackageVersions table.

Projects

This table stores information about projects. Each row represents a single project.

SnapshotAt: TIMESTAMP (REQUIRED)
When this row was exported.
Type: STRING (REQUIRED)
The type of the project, example values include 'APACHE_JIRA', 'BITBUCKET', 'GITHUB' and 'GITLAB'.
Name: STRING (REQUIRED)
The name of the project.
OpenIssuesCount: INTEGER
The number of open issues.
StarsCount: INTEGER
The number of stars in the upstream source.
ForksCount: INTEGER
The number of known forks.
Licenses: STRING (REPEATED)
A list of licenses for the project, presented as SPDX expressions.
Description: STRING
A human readable description of the project.
Homepage: STRING
A link to the homepage of the project.

ProjectsLatest

This view contains the most recent snapshot in the Projects table.

Snapshots

This table contains the timestamps for which a full snapshot of all other tables exists.

Time: TIMESTAMP (REQUIRED)
The time at which a snapshot of all other tables was taken.