数据容器:数据集和元数据

基本原理

数据处理程序产生的数据产品旨在供他人阅读、理解和使用。许多人倾向于存储数据而没有对这些数据附加含义的注释。没有附加的解释,其他人很难完全理解或正确使用一组数字。即使是数据生产者也很难在一段时间后回忆起这些数字的确切含义。当有人收到数据“产品”时,除了数据集之外,人们还会期望与产品相关的解释信息。

FDI 实施了数据产品容器方案,因此不仅描述和其他元数据(关于数据的数据)始终附加到“有效载荷”数据,而且您的数据可以将其上下文数据附加为轻量级引用。并可以用序列、映射、嵌套和引用的形式组织标量、向量、数组、表类型的数据。

FDI 旨在成为一个小型的开源包。存储在 FDI 对象中的数据可以通过 Python API 轻松访问,并以跨平台、人类可读的 JSON 格式导出(默认序列化和存储)。对于类似的目标,有更重量级的格式(例如 HDF5)和包(例如 iRODS)。FDI 的数据模型最初受到 Herschel Common Software System (v15) 产品的启发,同时考虑了科学观察和数据处理的其他要求。 API 尽可能与 HCSS(用 Java 编写,用 Jython 编写脚本)兼容。

数据容器

数据集

实现了三种类型的数据集以将潜在的任何分层数据存储为数据集。像产品一样,所有数据集都可能有元数据,区别在于数据集的元数据仅与该特定数据集相关。

数组数据集

包含数组数据(比如数据向量、数组、立方体等)的数据集,并且可能有一个单元和一个类型代码以进行高效存储。

例子(来自 快速开始 页)


>>> # Creation with an array of data quickly
... a1 = [1, 4.4, 5.4E3, -22, 0xa2]
... v = ArrayDataset(a1)
... # Show it. This is the same as print(v) in a non-interactive environment.
... # "Default Meta." means the metadata settings are all default values..
... v
ArrayDataset(shape=(5,). data= [1, 4.4, 5400.0, -22, 162])
>>> # Create an ArrayDataset with some built-in properties set.
... v = ArrayDataset(data=a1, unit='ev', description='5 elements', typecode='f')
... #
... # add some metadats (see more about meta data below)
... v.meta['greeting'] = StringParameter('Hi there.')
... v.meta['year'] = NumericParameter(2020)
... v
ArrayDataset(shape=(5,), description=5 elements, unit=ev, typecode=f, greeting=Hi there., year=2020. data= [1, 4.4, 5400.0, -22, 162])
>>> # data access: read the 2nd array element
... v[2]       # 5400
5400.0
>>> # built-in properties
... v.unit
'ev'
>>> # change it
... v.unit = 'm'
... v.unit
'm'
>>> # iteration
... for m in v:
...     print(m + 1)
2
5.4
5401.0
-21
163
>>> # a filter example
... [m**3 for m in v if m > 0 and m < 40]
[1, 85.18400000000003]
>>> # slice the ArrayDataset and only get part of its data
... v[2:-1]
[5400.0, -22]
>>> # set data to be a 2D array
... v.data = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
... # slicing happens on the slowest dimension.
... v[0:2]
[[1, 2, 3], [4, 5, 6]]
>>> # Run this to see a demo of the ``toString()`` function:
... # make a 4-D array: a list of 2 lists of 3 lists of 4 lists of 5 elements.
... s = [[[[i + j + k + l for i in range(5)] for j in range(4)]
...       for k in range(3)] for l in range(2)]
... v.data = s
... print(v.toString())
=== ArrayDataset (5 elements) ===
meta= {
===========  ============  ======  =======  =======  =========  ======  =====================
name         value         unit    type     valid    default    code    description
===========  ============  ======  =======  =======  =========  ======  =====================
shape        (2, 3, 4, 5)          tuple    None     ()                 Number of elements in
each dimension. Quic
k changers to the rig
ht.
description  5 elements            string   None     UNKNOWN    B       Description of this d
ataset
unit         m                     string   None     None       B       Unit of every element
.
typecode     f                     string   None     UNKNOWN    B       Python internal stora
ge code.
version      0.1                   string   None     0.1        B       Version of dataset
FORMATV      1.6.0.1               string   None     1.6.0.1    B       Version of dataset sc
hema and revision
greeting     Hi there.             string   None                B       UNKNOWN
year         2020          None    integer  None     None       None    UNKNOWN
===========  ============  ======  =======  =======  =========  ======  =====================
MetaData-listeners = ListnerSet{}}
ArrayDataset-dataset =
0  1  2  3  4
1  2  3  4  5
2  3  4  5  6
3  4  5  6  7


1  2  3  4  5
2  3  4  5  6
3  4  5  6  7
4  5  6  7  8


2  3  4  5  6
3  4  5  6  7
4  5  6  7  8
5  6  7  8  9


#=== dimension 4

1  2  3  4  5
2  3  4  5  6
3  4  5  6  7
4  5  6  7  8


2  3  4  5  6
3  4  5  6  7
4  5  6  7  8
5  6  7  8  9


3  4  5  6   7
4  5  6  7   8
5  6  7  8   9
6  7  8  9  10


#=== dimension 4
表格数据集

包含以列标题为键的列集合的数据集。每列包含数组数据集。所有列都具有相同的行数。

例子(来自 快速开始 页)


TableDataset 主要是一个包含命名列及其元数据的字典。列基本上是不同名称下的 ArrayDatasets。

>>> # Create an empty TableDataset then add columns one by one
... v = TableDataset()
... v['col1'] = Column(data=[1, 4.4, 5.4E3], unit='eV')
... v['col2'] = Column(data=[0, 43.2, 2E3], unit='cnt')
... v
TableDataset(Default Meta.data= {"col1": Column(shape=(3,), unit=eV. data= [1, 4.4, 5400.0]), "col2": Column(shape=(3,), unit=cnt. data= [0, 43.2, 2000.0])})
>>> # Do it with another syntax, with a list of tuples and no Column()
... a1 = [('col1', [1, 4.4, 5.4E3], 'eV'),
...       ('col2', [0, 43.2, 2E3], 'cnt')]
... v1 = TableDataset(data=a1)
... v == v1
True
>>> # Make a quick tabledataset -- data are list of lists without names or units
... a5 = [[1, 4.4, 5.4E3], [0, 43.2, 2E3], [True, True, False], ['A', 'BB', 'CCC']]
... v5 = TableDataset(data=a5)
... print(v5.toString())
=== TableDataset (UNKNOWN) ===
meta= {
===========  =======  ======  ======  =======  =========  ======  =====================
name         value    unit    type    valid    default    code    description
===========  =======  ======  ======  =======  =========  ======  =====================
description  UNKNOWN          string  None     UNKNOWN    B       Description of this d
                                                                  ataset
version      0.1              string  None     0.1        B       Version of dataset
FORMATV      1.6.0.1          string  None     1.6.0.1    B       Version of dataset sc
                                                                  hema and revision
===========  =======  ======  ======  =======  =========  ======  =====================
MetaData-listeners = ListnerSet{}}
TableDataset-dataset =
  column1    column2  column3    column4
   (None)     (None)  (None)     (None)
---------  ---------  ---------  ---------
      1          0    True       A
      4.4       43.2  True       BB
   5400       2000    False      CCC
>>> # access
... # get names of all columns (automatically given here)
... v5.getColumnNames()
['column1', 'column2', 'column3', 'column4']
>>> # get column by name
... my_column = v5['column1']       # [1, 4.4, 5.4E3]
... my_column.data
[1, 4.4, 5400.0]
>>> # by index
... v5[0].data       # [1, 4.4, 5.4E3]
[1, 4.4, 5400.0]
>>> # get a list of all columns' data.
... # Note the slice "v5[:]" and syntax ``in``
... [c.data for c in v5[:]]   # == a5
[[1, 4.4, 5400.0], [0, 43.2, 2000.0], [True, True, False], ['A', 'BB', 'CCC']]
>>> #  indexOf by name
... v5.indexOf('column1')  # == u.indexOf(my_column)
0
>>> #  indexOf by column object
... v5.indexOf(my_column)     # 0
0
>>> # set cell value
... v5['column2'][1] = 123
... v5['column2'][1]    # 123
123
>>> # row access bu row index -- multiple and in custom order
... v5.getRow([2, 1])  # [(5400.0, 2000.0, False, 'CCC'), (4.4, 123, True, 'BB')]
[(5400.0, 2000.0, False, 'CCC'), (4.4, 123, True, 'BB')]
>>> # or with a slice
... v5.getRow(slice(0, -1))
[(1, 0, True, 'A'), (4.4, 123, True, 'BB')]
>>> # unit access
... v1['col1'].unit  # == 'eV'
'eV'
>>> # add, set, and replace columns and rows
... # column set / get
... u = TableDataset()
... c1 = Column([1, 4], 'sec')
... # add
... u.addColumn('time', c1)
... u.columnCount        # 1
1
>>> # for non-existing names set is addColum.
... u['money'] = Column([2, 3], 'eu')
... u['money'][0]    # 2
... # column increases
... u.columnCount        # 2
2
>>> # addRow
... u.rowCount    # 2
2
>>> u.addRow({'money': 4.4, 'time': 3.3})
... u.rowCount    # 3
3
>>> # run this to see ``toString()``
... ELECTRON_VOLTS = 'eV'
... SECONDS = 'sec'
... t = [x * 1.0 for x in range(8)]
... e = [2.5 * x + 100 for x in t]
... d = [765 * x - 500 for x in t]
... # creating a table dataset to hold the quantified data
... x = TableDataset(description="Example table")
... x["Time"] = Column(data=t, unit=SECONDS)
... x["Energy"] = Column(data=e, unit=ELECTRON_VOLTS)
... x["Distance"] = Column(data=d, unit='m')
... # metadata is optional
... x.meta['temp'] = NumericParameter(42.6, description='Ambient', unit='C')
... print(x.toString())
=== TableDataset (Example table) ===
meta= {
===========  =============  ======  ======  =======  =========  ======  =====================
name         value          unit    type    valid    default    code    description
===========  =============  ======  ======  =======  =========  ======  =====================
description  Example table          string  None     UNKNOWN    B       Description of this d
                                                                        ataset
version      0.1                    string  None     0.1        B       Version of dataset
FORMATV      1.6.0.1                string  None     1.6.0.1    B       Version of dataset sc
                                                                        hema and revision
temp         42.6           C       float   None     None       None    Ambient
===========  =============  ======  ======  =======  =========  ======  =====================
MetaData-listeners = ListnerSet{}}
TableDataset-dataset =
   Time    Energy    Distance
  (sec)      (eV)         (m)
-------  --------  ----------
      0     100          -500
      1     102.5         265
      2     105          1030
      3     107.5        1795
      4     110          2560
      5     112.5        3325
      6     115          4090
      7     117.5        4855
复合数据集

包含一组数据集的数据集。这允许任意复杂的结构,因为复合数据集中的子数据集可能是复合数据集本身等等……

元数据和参数

FDI 数据集和产品不仅包含数据,还包含它们的元数据——关于“有效载荷”数据的数据。元数据被定义为命名参数的集合。

通常一个参数显示一个属性 因此,数据集或产品的元数据中的参数通常称为属性。

参数
具有属性的标量或向量变量。

有以下参数类型:

  • Parameter:类型在 metadata.ParameterTypes 中定义。如果请求,参数可以检查其值或具有有效性规范的给定值,其可以是离散值、范围和位掩码值的组合。

  • NumericParameter

  • DateParameter

  • StringParameter

Parameter class

parameter value

parameter attributes

参数

typed objects

description, type, validity descriptor, and default value

NumericParameter

a number (scalar), a Vector2D (2D), a Vector (3D), or a Quaternion (4D)

all above plus a unit and a typecode

DateParameter

FineTime date-time

Same as Parameter, type is ‘finetime’, Python :attribute:`datetime.format` string as the default typecode.

StringParameter

String

Same as Parameter, type is ‘string’, ‘B’ (for byte unsigned) as the default typecode

例子(来自 快速开始 页)


>>> # Creation
... # The standard way -- with keyword arguments
... v = Parameter(value=9000, description='Average age', typ_='integer')
... v.description   # 'Average age'
'Average age'
>>> v.value   # == 9000
9000
>>> v.type   # == 'integer'
'integer'
>>> # test equals.
... # FDI DeepEqual integerface class recursively compares all components.
... v1 = Parameter(description='Average age', value=9000, typ_='integer')
... v.equals(v1)
True
>>> # more readable 'equals' syntax
... v == v1
True
>>> # make them not equal.
... v1.value = -4
... v.equals(v1)   # False
False
>>> # math syntax
... v != v1  # True
True
>>> # NumericParameter with two valid values and a valid range.
... v = NumericParameter(value=9000, valid={
...                      0: 'OK1', 1: 'OK2', (100, 9900): 'Go!'})
... # There are thee valid conditions
... v
NumericParameter(description="UNKNOWN", type="integer", default=None, value=9000, valid=[[0, 'OK1'], [1, 'OK2'], [[100, 9900], 'Go!']], unit=None, typecode=None, _STID="NumericParameter")
>>> # The current value is valid
... v.isValid()
True
>>> # check if other values are valid according to specification of this parameter
... v.validate(600)  # valid
(600, 'Go!')
>>> v.validate(20)  # invalid
(Invalid, 'Invalid')
元数据

类管理数据集和产品的参数。

例子(来自 快速开始 页)


Metadata 实例主要是一个类似字典的命名参数容器。

>>> # Creation. Start with numeric parameter.
... a1 = 'weight'
... a2 = NumericParameter(description='How heavey is the robot.',
...                       value=60, unit='kg', typ_='float')
... # make an empty MetaData instance.
... v = MetaData()
... # place the parameter with a name
... v.set(a1, a2)
... # get the parameter with the name.
... v.get(a1)   # == a2
NumericParameter(description="How heavey is the robot.", type="float", default=None, value=60.0, valid=None, unit="kg", typecode=None, _STID="NumericParameter")
>>> # add more parameter. Try a string type.
... v.set(name='job', newParameter=StringParameter('pilot'))
... # get the value of the parameter
... v.get('job').value   # == 'pilot'
'pilot'
>>> # access parameters in metadata
... # a more readable way to set/get a parameter than "v.set(a1,a2)", "v.get(a1)"
... v['job'] = StringParameter('waitress')
... v['job']   # == waitress
StringParameter(description="UNKNOWN", default="", value="waitress", valid=None, typecode="B", _STID="StringParameter")
>>> # same result as...
... v.get('job')
StringParameter(description="UNKNOWN", default="", value="waitress", valid=None, typecode="B", _STID="StringParameter")
>>> # Date type parameter use International Atomic Time (TAI) to keep time,
... # in 1-microsecond precission
... v['birthday'] = Parameter(description='was born on',
...                           value=FineTime('1990-09-09T12:34:56.789098 UTC'))
... # FDI use International Atomic Time (TAI) internally to record time.
... # The format is the integer number of microseconds since 1958-01-01 00:00:00 UTC.
... v['birthday'].value.tai
Time zone stripped for 1990-09-09T12:34:56.789098 UTC according to format.
1031574921789098
>>> # names of all parameters
... [n for n in v]   # == ['weight', 'job', 'birthday']
['weight', 'job', 'birthday']
>>> # remove parameter from metadata.   # function inherited from Composite class.
... v.remove(a1)
... v.size()  # == 2
2
>>> # The value of the next parameter is valid from 0 to 31 and can be 9
... valid_rule = {(0, 31): 'valid', 99: ''}
... v['a'] = NumericParameter(
...     3.4, 'rule name, if is "valid", "", or "default", is ommited in value string.', 'float', 2., valid=valid_rule)
... v['a'].isValid()    # True
True
>>> then = datetime(
...     2019, 2, 19, 1, 2, 3, 456789, tzinfo=timezone.utc)
... # The value of the next parameter is valid from TAI=0 to 9876543210123456
... valid_rule = {(0, 9876543210123456): 'alive'}
... v['b'] = DateParameter(FineTime(then), 'date param', default=99,
...                        valid=valid_rule)
... # display format set to 'year' (%Y)
... v['b'].format = '%Y-%M'
... # The value of the next parameter has an empty rule set and is always valid.
... v['c'] = StringParameter(
...     'Right', 'str parameter. but only "" is allowed.', valid={'': 'empty'}, default='cliche', typecode='B')
>>> # The value of the next parameter is for a detector status.
... # The information is packed in a byte, and if extractab;e with suitable binary masks:
... # Bit7~Bit6 port status [01: port 1; 10: port 2; 11: port closed];
... # Bit5 processing using the main processir or a stand-by one [0:  stand by; 1: main];
... # Bit4 PPS status [0: error; 1: normal];
... # Bit3~Bit0 reserved.
... valid_rule = {
...     (0b11000000, 0b01): 'port_1',
...     (0b11000000, 0b10): 'port_2',
...     (0b11000000, 0b11): 'port closed',
...     (0b00100000, 0b0): 'stand_by',
...     (0b00100000, 0b1): 'main',
...     (0b00010000, 0b0): 'error',
...     (0b00010000, 0b1): 'normal',
...     (0b00001111, 0b0): 'reserved'
... }
... v['d'] = NumericParameter(
...     0b01010110, 'valid rules described with binary masks', valid=valid_rule)
... # this returns the tested value, the rule name, the heiggt and width of every mask.
... v['d'].validate(0b01010110)
[(1, 'port_1', 8, 2),
 (0, 'stand_by', 6, 1),
 (1, 'normal', 5, 1),
 (Invalid, 'Invalid')]
>>> # string representation. This is the same as v.toString(level=0), most detailed.
... print(v.toString())
========  ====================  ======  ========  ====================  =================  ======  =====================
name      value                 unit    type      valid                 default            code    description
========  ====================  ======  ========  ====================  =================  ======  =====================
job       waitress                      string    None                                     B       UNKNOWN
birthday  1990-09-09T12:34:56.          finetime  None                  None                       was born on
          789098
          1031574921789098
a         3.4                   None    float     (0, 31): valid        2.0                None    rule name, if is "val
                                                  99:                                              id", "", or "default"
                                                                                                   , is ommited in value
                                                                                                    string.
b         alive (2019-02-19T01          finetime  (0, 9876543210123456  1958-01-01T00:00:  Q       date param
          :02:03.456789                           ): alive              00.000099
          1929229360456789)                                             99
c         Invalid (Right)               string    '': empty             cliche             B       str parameter. but on
                                                                                                   ly "" is allowed.
d         port_1 (0b01)         None    integer   11000000 0b01: port_  None               None    valid rules described
          stand_by (0b0)                          1                                                 with binary masks
          normal (0b1)                            11000000 0b10: port_
          Invalid                                 2
                                                  11000000 0b11: port
                                                  closed
                                                  00100000 0b0: stand_
                                                  by
                                                  00100000 0b1: main
                                                  00010000 0b0: error
                                                  00010000 0b1: normal
                                                  00001111 0b0000: res
                                                  erved
========  ====================  ======  ========  ====================  =================  ======  =====================
MetaData-listeners = ListnerSet{}
>>> # simplifed string representation, toString(level=1)
... v
job=waitress, birthday=1031574921789098, a=3.4, b=alive (1929229360456789), c=Invalid (Right), d=port_1 (0b01), stand_by (0b0), normal (0b1), Invalid.
>>> # simplest string representation, toString(level=2).
... print(v.toString(level=2))
job=waitress, birthday=FineTime(1990-09-09T12:34:56.789098), a=3.4, b=alive (FineTime(2019-02-19T01:02:03.456789)), c=Invalid (Right), d=port_1 (0b01), stand_by (0b0), normal (0b1), Invalid.

运行测试

你可以分别测试 dataset and utils with test1 and test5 这些子包。

在安装目录

make test1
make test5

test3 仅用于 pns 服务器自检

设计

../_images/packages_dataset.png

../_images/classes_dataset.png