Skip to main content

Lineage in Geoscience Objects

Geoscience object versions can contain data lineage information - that is, information about the datasets and transformations that led to that version being created. This will allow exploration of the relationship between objects, including provenance ("what was the ultimate source of this data?") and impact analysis ("what are the ultimate outputs derived from this data?").

Lineage can be included in a geoscience object publish using the lineage schema property. This is available on all recent schema versions for all geoscience object types. The lineage schema definition can be found in the geoscience object schema repository.

The structure consists of a list of events, plus a self_link property.

Events

A lineage event represents a state transition that has happened within the data transformation(s) that resulted in this object version, such as the completion of a job run. Events follow the OpenLineage RunEvent format. Each lineage event includes the following elements:

  • job: an abstract process definition that takes input datasets and produces output datasets. It is identified by a unique name within a namespace.
  • run: the particular run of a job that this event occurred on. It is identified by a runId - a client-generated UUID, which must be unique to the run.
  • eventType: the state transition within the run that this event describes. For example, COMPLETE represents the event of a job run completing.
  • inputs: the set of input datasets.
  • outputs: the set of output datasets.
  • producer: a URL identifying the code that emitted the event data structure.

The example below shows a simple lineage event that was the completion of a compute task run which resulted in the object update that is being published. The job run has two inputs that are geoscience object versions.

{
"eventType": "COMPLETE",
"job": {
"namespace": "evocompute",
"name": "parsers.dxf:v1.0.0"
},
"run": {
"runId": "123e4567-e89b-12d3-a456-426614174000"
},
"inputs": [
{
"namespace": "evogeoscienceobject://hub1/b3bfd062-820d-4aa4-9751-bda2f1cd8946/c2be55db-53f0-4032-b57f-dde3ddd678d8",
"name": "65e24708-6ce3-4d85-b8a3-abcdb1ced751:1720050462228330567"
},
{
"namespace": "evogeoscienceobject://hub1/b3bfd062-820d-4aa4-9751-bda2f1cd8946/c2be55db-53f0-4032-b57f-dde3ddd678d8",
"name": "5b10a44d-75ee-4a3b-b0ae-61967a35fdc5:1720050462228330272"
}
],
"outputs": [
{
"namespace": "evogeoscienceobject://hub1/b3bfd062-820d-4aa4-9751-bda2f1cd8946/c2be55db-53f0-4032-b57f-dde3ddd678d8",
"name": "$self"
}
],
"producer": "https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client",
"schemaURL": "https://github.com/SeequentEvo/evo-schemas/blob/main/schema/components/lineage/1.0.0/lineage.schema.json#/$defs/RunEvent"

}

Which events to include?

For a given version publish, the lineage should include all events available for the run that produced this version. Preceding events for runs that produced intermediate datasets that are outside Evo may also be included.

Where the input is an existing geoscience object, events related to the creation of the input should not be included, as lineage should be recorded on the input object version. This includes when a previous version of the object being updated is used as an input, as the previous version should have its own lineage.

Diagram showing events and datasets

Above: an example scenario where the new version being published is derived from an external dataset and a previous object version. The events to be included are those for the run that is doing the publishing, and the one that created the external dataset being used.

Datasets

Each dataset is identified by a unique name within a namespace.

The namespace for a dataset is the unique name for its datasource, and is structured as a URI. The namespace for geoscience objects follows the structure evogeoscienceobject://<hub>/<instance_id>/<workspace_id>.

The name for a dataset is the unique name for the dataset within the namespace. For geoscience objects, the standard name format is <geoscience_object_id>:<version>. The inclusion of the version identifier enables the tracing of relationships between specific versions of objects.

Output object IDs and $self

As geoscience object ids and version ids are not known until the object is published (in the case of an update, just the version id), the output dataset that represents the object version that is currently being published (that the lineage is being sent with) should use the special name $self.

Note that if the event has multiple outputs that are geoscience objects, the lineage sent to each should not include output datasets for the other created geoscience objects, as that would result in ambiguous $self identifiers. The future lineage querying API will automatically recombine these into a single event record.

The self_link value should be a JSON pointer to the output dataset that represents the object version being published (i.e. with name $self). If there are multiple events with this output, it should point at the latest.

The following, for example, points at the first event's first output dataset:

{
...
"lineage": {
"self_link": "/events/0/outputs/0",
...
}
}

Facets

Facets allow events to attach additional information to their jobs, runs, inputs, and outputs. There are a number of standard OpenLineage facets defined. Additionally, custom facets can be defined as part of the event, allowing any desired extra information to be included.

Note that the Geoscience Object Service does not validate the contents of facets against any schema.