dataset: Model for Data Container

Rationale

A data processing task produces data products that are meant to be shared by other people. When someone receives a data ‘product’ besides datasets s/he woud expect explanation informaion associated with the product.

Many people tend to store data with no note of meaning attached to them. Without attach meaning of the collection of numbers, it is difficult for other people to fully understand or use the data. It could be difficult for even the data producer to recall the exact meaning of the numbers after a while.

This package implements a data product modeled after Herschel Common Software System (v15) products, taking other requirements of scientific observation and data processing into account. The APIs are kept as compatible with HCSS (written in Java, and in Jython for scripting) as possible.

Definitions

Product

A product has
  • zero or more datasets: defining well described data entities (say images, tables, spectra etc…).

  • history of this product: how was this data created,

  • accompanying meta data – required information such as who created this product, what does the data reflect (say instrument) and so on; possible additional meta data specific to that particular product type.

Dataset

Three types of datasets are implemented to store potentially any data as a dataset. Like a product, all datasets may have meta data, with the distinction that the meta data of a dataset is related to that particular dataset only.

array dataset

a dataset containing array data (say a data vector, array, cube etc…) and may have a unit.

table dataset

a dataset containing a collection of columns. Each column contains array data (say a data vector, array, cube etc…) and may have a unit. All columns have the same number of rows. Together they make up the table.

composite dataset

a dataset containing a collection of datasets. This allows arbitrary complex structures, as a child dataset within a composite dataset may be a composite dataset itself and so on…

Metadata and Parameters

Meta data

data about data. Defined as a collection of parameters.

Parameter
named scalar variables.

This package provides the following parameter types:

  • _Parameter_: Contains value (classes whitelisted in ParameterTypes)

  • _NumericParameter_: Contains a number with a unit.

Apart from the value of a parameter you can ask it for its description and -if it is a numeric parameter- for its unit as well.

History

The history is a lightweight mechanism to record the origin of this product or changes made to this product. Lightweight means, that the Product data itself does not records changes, but external parties can attach additional information to the Product which reflects the changes.

The sole purpose of the history interface of a Product is to allow notably pipeline tasks (as defined by the pipeline framework) to record what they have done to generate and/or modify a Product.

Serializability

In order to transfer data across the network between heterogeneous nodes data needs to be serializable. JSON format is used considering to transfer serialized data for its wide adoption, availability of tools, ease to use with Python, and simplicity.

run tests

In the install directory:

make test

You can only test sub-package dataset, pal, pns or pns server self-test only, by changing test above to test1, test2, test3, test4, respectively. To pass command-line arguments to pytest do

make test T='-k Bas'

to test BaseProduct in sub-package dataset.

Design

Packages

../_images/packages_dataset.png

Classes

../_images/classes_dataset.png