DMTN-238: RSP DataLink service implementation strategy

  • Russ Allbery

Latest Revision: 2022-11-16

Abstract

The Rubin Science Platform provides an IVOA TAP service that returns structured information about Rubin Observatory data products. To allow easy user exploration of that data, particularly via the Portal Aspect, we’ve identified at least three types of information that we may want to associate with the returned records: how to download the image associated with an observation, how to retrieve a cutout of that image, and additional related TAP queries that the user may wish to explore. We expect to have additional similar use cases in the future. This tech note describes how we meet those requirements using the IVOA DataLink protocol and the datalinker web service.

Service descriptors

The second place we use DataLink is to provide easy access to additional queries related to the results of a TAP query. We add those service descriptors to the VOTable returned by the TAP query, pointing to a URL provided by the datalinker service with parameters filled in either by elements from the row of the query or by data entered by the user. That service, in turn, constructs a new TAP query and redirects the user back to the TAP service to perform that query.

For example, one of the additional queries we provide retrieves a light curve for an object. We attach that service descriptor to any results that include the objectId column of the Object table. The access URL of the DataLink entry points to <rsp-base-url>/datalink/timeseries. This URL takes the following parameters:

  1. id is set to the value of objectId

  2. table (the table containing time series data) is set to the ForcedSource table

  3. id_column (the foreign key in that table) is set to objectId

  4. join_time_column (the column containing the observation time) is set to the expMidptMJD column in the CcdVisit table

  5. band is up to the user to choose from the values all or any one of ugrizy

  6. detail is up to the user to choose from full, principal, or minimal (discussed further in datalinker links configuration)

This is just an example; the details are specific to this one schema and not horribly important. The general principle is that any result column can be linked to a service, specifying some mix of fixed parameters (based on the result row or on known details of the table to which the DataLink service descriptor is attached) and user-chosen parameters. The Portal Aspect and other IVOA clients can then prompt the user for the required parameters and construct an access URL for that service.

When that URL is visited, the effect will be to execute a new TAP query and return the result as a new VOTable. That VOTable can, in turn, have new service descriptors, allowing the user to follow these service prompts to explore the data further if they choose.

Making this all work involves several moving parts.

TAP configuration

Each of these DataLink service descriptors is defined in an XML file in the sdm_schemas repository. This is a verbatim copy of the service descriptor to return, apart from an <INFO> tag that describes the column to which the service descriptor is attached.

<INFO name="$dp02_dc2_catalogs_Object_objectId$"
      ID="$dp02_dc2_catalogs_Object_objectId$"
      value="this will be dropped..." />

A parameter in the resulting DataLink service descriptor will then look like:

<PARAM name="id" datatype="long" ref="$dp02_dc2_catalogs_Object_objectId$"
       value="" ucd="meta.id">

In the DataLink service descriptor returned to the client, that ref value will be replaced with a reference to the corresponding column in the result VOTable.

All DataLink service descriptors are described in a JSON manifest file named datalink-manifest.json. This is a JSON object, the keys of which are the names of the XML files found in the same directory that define the DataLink service descriptor, and the values are lists of columns to which that service descriptor applies. Periods in the columns are replaced with underscores. For example, a manifest containing only an entry for the DataLink service descriptor described above would look like:

{
    "dp02_object_to_fs_timeseries": ["dp02_dc2_catalogs_Object_objectId"]
}

Each time a release is made from the sdm_schemas repository, a GitHub release artifact is created (via GitHub Actions) named datalink-snippets.zip. The TAP service is then configured (using datalinkPayloadUrl in its values.yaml file in Phalanx) to point to that artifact.

When the TAP service is restarted, that URL is retrieved and unpacked. The Rubin Science Platform TAP service then reads that directory and uses it to annotate results.

datalinker butler configuration

Currently, datalinker calls Butler directly and therefore has to be configured with the storage used by Butler, as well as the PostgreSQL database the Butler uses to find the files in the metadata. In order for the Butler to connect to the PostgreSQL server, known as a Butler repository, the PGUSER and PGPASSFILES must be set. These are typically loaded from the vault containing the secrets for the environment.

Depending on the environment, the storage for the Butler may be hosted on Google, or a privately hosted S3. To configure this, first set the STORAGE_BACKEND to either GCS or S3. If GCS is chosen, set the S3_ENDPOINT_URL to https://storage.googleapis.com (the default). If S3 is chosen, set S3_ENDPOINT_URL to the base http URL of the private S3 service. Many of these secrets come from the vault containing the secrets for the environment.

No matter if GCS or S3 is chosen, these credentials are used to sign the image URLs.

Future work

  • Once client/server Butler (DMTN-176) is available, datalinker is expected to use it for the query and request a signed URL directly from it rather than signing its own.

  • Currently, each time there is a new release of the schemas, the Phalanx configuration of both datalinker and TAP have to be updated to pick up the new versions of the release artifacts they use for configuration. This creates an unwanted opportunity for version mismatches between components of a given Science Platform deployment.

    In the future, rather than use the current mechanism of directly retrieving artifacts from GitHub, we plan to have the tap-schema service, which already serves the TAP schema database, to also deploy a microservice that provides this data. TAP and datalinker will then query that service within the same Science Platform deployment, and the data returned by the service will always match the deployed version of the TAP schema.

  • The minimal value of the detail parameter is currently not fully defined and hasn’t been tested against what types of data explorations users may wish to perform.