DMTN-238: RSP DataLink service implementation strategy

  • Russ Allbery

Latest Revision: 2022-09-30

Abstract

The Rubin Science Platform provides an IVOA TAP service that returns structured information about Rubin Observatory data products. To allow easy user exploration of that data, particularly via the Portal Aspect, we’ve identified at least three types of information that we may want to associate with the returned records: how to download the image associated with an observation, how to retrieve a cutout of that image, and additional related TAP queries that the user may wish to explore. We expect to have additional similar use cases in the future. This tech note describes how we meet those requirements using the IVOA DataLink protocol and the datalinker web service.

Service descriptors

The second place we use DataLink is to provide easy access to additional queries related to the results of a TAP query. We add those service descriptors to the VOTable returned by the TAP query, pointing to a URL provided by the datalinker service with parameters filled in either by elements from the row of the query or by data entered by the user. That service, in turn, constructs a new TAP query and redirects the user back to the TAP service to perform that query.

For example, one of the additional queries we provide retrieves a light curve for an object. We attach that service descriptor to any results that include the objectId column of the Object table. The access URL of the DataLink entry points to <rsp-base-url>/datalink/timeseries. This URL takes the following parameters:

  1. id is set to the value of objectId

  2. table (the table containing time series data) is set to the ForcedSource table

  3. id_column (the foreign key in that table) is set to objectId

  4. join_time_column (the column containing the observation time) is set to the expMidptMJD column in the CcdVisit table

  5. band is up to the user to choose from the values all or any one of ugrizy

  6. detail is up to the user to choose from full, principal, or minimal (discussed further in datalinker service)

This is just an example; the details are specific to this one schema and not horribly important. The general principle is that any result column can be linked to a service, specifying some mix of fixed parameters (based on the result row or on known details of the table to which the DataLink service descriptor is attached) and user-chosen parameters. The Portal Aspect and other IVOA clients can then prompt the user for the required parameters and construct an access URL for that service.

When that URL is visited, the effect will be to execute a new TAP query and return the result as a new VOTable. That VOTable can, in turn, have new service descriptors, allowing the user to follow these service prompts to explore the data if they choose.

Making this all work involves several moving parts.

TAP configuration

Each of these DataLink service descriptors is defined in an XML file in the sdm_schemas repository. This is a verbatim copy of the service descriptor to return, apart from an <INFO> tag that describes the column to which the service descriptor is attached.

<INFO name="$dp02_dc2_catalogs_Object_objectId$"
      ID="$dp02_dc2_catalogs_Object_objectId$"
      value="this will be dropped..." />

A parameter in the resulting DataLink service descriptor will then look like:

<PARAM name="id" datatype="long" ref="$dp02_dc2_catalogs_Object_objectId$"
       value="" ucd="meta.id">

In the DataLink service descriptor returned to the client, that ref value will be replaced with a reference to the corresponding column in the result VOTable.

All DataLink service descriptors are described in a JSON manifest file named datalink-manifest.json. This is a JSON object, the keys of which are the names of the XML files found in the same directory that define the DataLink service descriptor, and the values are lists of columns to which that service descriptor applies. Periods in the columns are replaced with underscores. For example, a manifest containing only an entry for the DataLink service descriptor described above would look like:

{
    "dp02_object_to_fs_timeseries": ["dp02_dc2_catalogs_Object_objectId"]
}

Each time a release is made from the sdm_schemas repository, a GitHub release artifact is created (via GitHub Actions) named datalink-snippets.zip. The TAP service is then configured (using datalinkPayloadUrl in its values.yaml file in Phalanx) to point to that artifact.

When the TAP service is restarted, that URL is retrieved and unpacked. The Rubin Science Platform TAP service then reads that directory and uses it to annotate results.

datalinker service

The implementation of this service on the datalinker side is conceptually simple. It uses the parameters to construct a TAP query, and then returns a redirect to the TAP service sync API to perform that query. Because this service is effectively a front end to TAP queries, it is protected by the read:tap scope.

The main complexity comes from the detail parameter. The purpose of detail is to allow the user to decide what columns should be included in the query result. The three valid values are:

full

Include all columns of the table being queried.

principal

Include only the principal (most generally useful) columns. These are the columns for which the principal flag is true in the corresponding TAP schema entry.

minimal [2]

Include only the minimal columns required to follow service descriptor links to other tables of interest. This is intended for use in data exploration, where the next link in the exploration isn’t the final destination but is only being followed to reach other tables of interest.

In order to correctly respond to these requests for a detail parameter other than full, the datalinker service has to be able to look up the list of columns to include for a given table.

For principal, since this is the same as the principal columns in the TAP schema, the canonical source of this data is the underlying schema. For Rubin Observatory, the schema definitions live in the sdm_schemas repository, and are maintained by a software package called Felis. The principal columns are tagged with tap:principal: 1 in the Felis definition.

Therefore, to provide this information to the datalinker service, the same GitHub Actions build process that generates datalink-snippets.zip above also generates the artifact datalink-columns.zip. This ZIP archive contains two files, columns-minimal.json and columns-principal.json, which list the minimal and principal columns for each table.

columns-principal.json is generated automatically from the Felis schema, and the columns are ordered according to the order defined by the tap:column_index attributes of each column.

There is not yet a Felis attribute for marking minimal columns, so for the time being the columns-minimal.json file is hand-maintained.

Future work

  • Currently, datalinker calls Butler directly and therefore has to be configured with a Google service account private key and the password to the underlying PostgreSQL database used by Butler. It uses the same service account to sign the image URLs.

    Once client/server Butler (DMTN-176) is available, datalinker is expected to use it for the query and request a signed URL directly from it rather than signing its own.

  • Currently, each time there is a new release of the schemas, the Phalanx configuration of both datalinker and TAP have to be updated to pick up the new versions of the release artifacts they use for configuration. This creates an unwanted opportunity for version mismatches between components of a given Science Platform deployment.

    In the future, rather than use the current mechanism of directly retrieving artifacts from GitHub, we plan to have the tap-schema service, which already serves the TAP schema database, to also deploy a microservice that provides this data. TAP and datalinker will then query that service within the same Science Platform deployment, and the data returned by the service will always match the deployed version of the TAP schema.

  • The minimal value of the detail parameter is currently not fully defined and hasn’t been tested against what types of data explorations users may wish to perform.