HDF5

create HDF5 files

Description

HDF5 is version 5 of a standardized hierarchical data format for storing and managing data. See the HDF Group home page and documentation pages for detailed information on the HDF5 specifications. spec implements a subset of the HDF5 functionality sufficient to archive the type of data that spec acquires.

The hierarchical part of the HDF5 format refers to the tree structure of named objects. Each object in the file belongs to an HDF5 group. Each group has a path in the HDF5 file starting from the root group which is named /. Path name components are separated by slash characters (/).

spec creates HDF5 objects called attributes and datasets. Attributes and datasets are similar in that both have names and both have associated data. The main difference is that attributes are intended to store a small amount of data. In addition, datasets may contain attributes, but attributes cannot contain additional attributes. spec can also create a dataset that contains a hard link (or pointer) to another named object.

Four built-in functions are provided: h5_file() to create or open an HDF5 file, h5_attr() to create attributes, h5_data() to create or add to datasets and h5_link() to create soft links to objects in external files.

Although spec arrays have just one or two dimensions, it is possible to save a three-dimensional dataset one 2D array at a time. It is also possible to save a two-dimensional dataset one row at a time.

spec supports setting dataset maximum dimensions, setting the chunk sizes (important for efficient reading and writing of large datasets), setting the memory cache size (important for writing data efficiently to the file) and enabling dataset compression.

Linking spec With HDF5

spec needs to be linked with the HDF5 library to enable the capabilities described in this help file. The HDF5 library version should match the version used to build spec (see below).

The spec distribution includes a prebuilt static library archive named libhdf5.a that is the matching version and can be linked with spec during installation. Because spec supports chunk compression, the ZLIB data compression library libz.a is also needed and is also included in the spec distribution.

The static libraries are very large and nearly double the size of the spec executable image. Installing a shared library on the system would be better, but would need to be done by a local system administrator. Complete instructions for downloading and installing HDF5 are available by following the appropriate links on the HDF group home page.

The spec Install script includes the following section for configuring the HDF5 libraries:

Choices for HDF5 libraries are:

1) no - not using HDF5
2) libhdf5.a libz.a - use static libraries
3) -lhdf5 -lz - use system libraries (possibly shared)
4) libhdf5.a -lz - use static hdf5 and system libz
5) other - enter library arguments

The location of the HDF5 libraries. Include to enable
use of the HDF (Hierarchical Data Format) data output
commands. Note, library version should match spec build.

Choose HDF5 libraries (no)?

In the current spec release, the default is not to link with the HDF5 libraries, as the HDF5 format is not yet widely used and linking with the static libraries greatly expands spec's memory footprint. Choice 2 enables linking with the included static libraries. Choice 3 will use the system libraries, as long as the libraries are in a standard place.

Sources corresponding to the libraries included in the current spec distribution can be downloaded at certif.com/downloads/extras/hdf5.tgz and certif.com/downloads/extras/zlib.tgz.

If spec is not linked with the matching HDF5 library version, one can set the environment variable HDF5_DISABLE_VERSION_CHECK to a value of 1 or greater to enable use of the functions. If the value is 1, then HDF5 libraries will produce a lengthy warning message on the first call of an HDF5 function. If the value is 2 or greater, the message from the libraries will be suppressed, although spec will print a one-time short warning message. CSS cannot predict the results of using a mismatched version. Note, the version matching is done on the major and minor version numbers. For example, versions 1.8.13 and 1.8.14 differ only in release number, which will not produce a warning message.

Mapping of spec Data Types to HDF5 Data Types

spec has three categories of data: scalars, associative arrays and data arrays. Values for scalar and associative array elements are stored in spec as either strings or double-precision floating point numbers. Data array values match the declared type of the data array.

By default, the h5_attr() and h5_data() functions will use an HDF5 data type that corresponds to the spec data type. An optional argument to the function can specify a particular HDF5 data type as follows:

"byte" H5T_NATIVE_INT8
"ubyte" H5T_NATIVE_UINT8
"short" H5T_NATIVE_INT16
"ushort" H5T_NATIVE_UINT16
"long" H5T_NATIVE_INT32
"ulong" H5T_NATIVE_UINT32
"long64" H5T_NATIVE_INT64
"ulong64" H5T_NATIVE_UINT64
"float" H5T_NATIVE_FLOAT
"double" H5T_NATIVE_DOUBLE
"string" H5T_C_S1 with H5T_VARIABLE

If the "string" option is used to convert numbers to strings, the numbers will be formatted using a "%.15g" specification. Each row of a string data array is written as a fixed length string.

Writing spec Associative Arrays

In general, spec associative arrays can be one- or two-dimensional. The array index values are stored as strings, although the strings can be representations of numbers. spec can only save associative arrays with non-negative integer indexes to HDF5 files. spec will dimension the HDF5 array data space to the largest index in each dimension of the associative array. If an associative array has only one element with an index of 1000, the HDF5 array will have space allocated for 1001 elements. Missing values will be written to the HDF5 file as zeroes. The best practice is to save only associative arrays that have consecutive integer indexes starting at zero.

Before writing the data, spec scans the associative array to find the maximum index dimension and to determine if the elements are all numbers or if one or more elements is a string. If any element of the array is not a number, all elements will be saved as strings.

When saving values as numbers, the data type can be specified with an optional argument, as described above.

Writing spec Data Arrays

spec data arrays are one- or two-dimensional storage of fixed size and data type, indexed by integers starting at zero. spec allocates data space in the HDF5 file to fit the entire array, unless options described in the next section are used to reserve additional space. By default, data is written using the declared array data type. The data type can be converted using one of the options described above.

For two-dimensional arrays, spec will allocate rows and columns in an HDF5 data space to match the array declaration. For a one-dimensional array (including a single row or single column subarray), the default is to write the data as a single row. However, if the "row_wise" option has been explicitly set using array_op() for the parent array (see the arrays help file), or as an optional argument to h5_attr() or h5_data(), the data will be saved as a column array. A one-dimensional array declared explicitly as a column array, for example, arr[20][1] will also be saved as a column array.

Space for spec string data arrays is allocated for fixed length strings matching the storage size of the arrays. Values for each row of the string array are written as strings to the HDF5 file.

Additional Options For Data Arrays with h5_data()

In addition to the data-type arguments and "row_wise" argument discussed above, the following additional arguments can be used with h5_data() when saving data arrays.

"dims=[frames:]rows:columns"
Sets the maximum dimensions of the array for arrays that are written one frame or one row at a time. The first dimension can be -1 to set the dimension as unlimited. The current dimensions of the dataset will be set to match the dimensions of the data array passed to the h5_data() function.
"chunk=[frames:]rows:columns"
Sets the dimensions of the contiguous blocks of data in the HDF5 file. The values have a significant impact on the efficiency for reading and writing data. Chunk values must be set in order to enable data compression or to be able to add frames or rows to existing datasets. If the keyword "chunk" is used with no values, spec will set the first chunk dimension to 1 and the other dimensions to match the size of a single frame or row of the data array. Consult HDF5 documentation for suggestions on chunk optimization.
"cache=bytes[[:slots]:policy]"

Sets the memory cache parameters for the raw data chunks. bytes is the total number of bytes in the cache. The default is 1 MB per dataset. slots is the number of chunk slots in the raw data cache. This value should be a prime number to optimize the hashing algorithm. The default value is 521. policy is the chunk preemption policy and can be any value from zero to one. The default is 0.75. See the HDF5 documentation for a full explanation.

Setting the cache parameters with h5_data() only applies to the current function call. To set the parameters for all data sets written to a file, set the option with h5_file().

"frame=#" or simply #
For adding to existing datasets that have the first dimension set as unlimited (see above). The value # specifies the frame number when adding a 2D array to a 3D dataset, the row number when adding for a 1D array to a 2D dataset or the point number if adding a scalar value to a 1D dataset.
"gzip[=deflation_factor]
Enables compression for the dataset. If the "chunk" option is missing, spec will enable chunking and set the dimensions automatically as described above. Valid deflation_factor values are from 1 to 9, with 9 being maximum compression, which is the default if no value is specified.

If the "dims" or "gzip" options are used, but the "chunk" values are not set, spec will automatically set the chunk sizes as specified above.

Group Paths

The first argument to the h5_attr() and h5_data() functions is the path to the object. For each open HDF5 file, spec maintains a current-path value. When the file is first opened the current path is /. After each call to h5_attr() or h5_data(), the current path is set to the first argument of the functions. An argument "." refers to the current path. The path argument can also use the string ".." to refer to the parent group.

HDF5 Error Messages and Debugging

The built-in variable HDF5_ERROR_MODE can be set to tune the verbosity level of the errors generated by the HDF5 library. The HDF5 library generates an error stack, where each function in the library adds a message to the stack when any function below it returns an error. One can choose to display just the error description associated with the function that generated the error, or the entire stack with varying degrees of detail.

1 - Display error description from bottom of error stack
2 - Display descriptions from entire error stack
3 - Include source file and line number of error
4 - Include major and minor error code texts

A value less than the minimum or greater than the maximum will be treated as the minimum or maximum, respectively.

Adding 0x01000000 to the spec debug level enables spec-generated HDF5 debugging messages.

Built-in Functions

The following functions are available. Other than those that return the current file, current group or type of object, all functions return true (1) for success and false (0) otherwise.

h5_file(filename, "open" [,"cache=bytes[[:slots]:policy]"])
Opens filename if it exists or creates it if it doesn't. The HDF5 file remains open across calls, although all objects and resources created by h5_attr() and h5_data() are closed and released when those functions return. The optional argument sets the raw data chunk cache parameters for all data sets saved to the file as long as it remains open. The meaning of the parameters is explained above.
h5_file(filename, "flush")
Tells the HDF5 library to write buffered data for filename out to disk.
h5_file(filename, "close")
Closes filename.
h5_file(filename)
If more than one HDF5 file is open, makes filename the active one. If no HDF5 file is open, attempts to open filename. Returns true (1) for success.
h5_file()
Returns the name of the currently active HDF5 file, or the null string if no file is open.
h5_attr(group_path, attribute_name, attribute_value [, option ...])
Creates a new attribute with the name attribute_name and the value attribute_value. The options can be a spec data type as in the table above.
h5_data(group_path, dataset_name, dataset_value [, option ...])
Creates a new dataset with the name dataset_name and the value dataset_value. The options can be a spec data type, as in the table above. Additional options include "dims=[frames:]rows:columns", "chunk=[frames:]rows:columns", "cache=bytes[[:slots]:policy]" and "gzip[=deflation_factor].
h5_data(group_path, dataset_name, dataset_value, "frame=#"|#)
Adds dataset_value to the existing dataset at the specified frame or row number.
h5_data(group_path, dataset_name, dataset_link, "link")
Creates a dataset containing a hard link from dataset_link to dataset_name.
h5_attr() or h5_data()
With no arguments, either function returns the path of the current group.
h5_attr(group_path) or h5_data(group_path)
Sets the current group to group_path. A call such as h5_attr("..") moves the current group up one level.
h5_attr("name", "type") or h5_data("name", "type")
Returns the string "group", "dataset" or "attribute" to indicate whether the first argument exists in the current HD5F file and what kind of thing it is. The argument can be a complete path to the object or an object in the current path. A null string is returned if the object "name" doesn't exist. For this usage, h5_attr() and h5_data() are interchangeable. This function does not change the current group or dataset.
h5_link(group_path, link_name, target_file, target_object)
Creates a soft link named link_name pointing to the specified target object in the target file.

Built-in Symbols

The following built-in symbols are available:

HDF5_VERSION
Contains the HDF5 version used when compiling spec, as in "1.8.13". Note, its value will not be set until the first call of one of the above functions.
HDF5_ERROR_MODE
Sets the verbosity level of error messages. Valid values are 1 through 4. Higher levels are useful for debugging calls of the above functions in conjunction with the HDF5 library source code. The default value is 1.