Data storage with qudi

Qudi provides data storage objects that can be imported from qudi.util.datastorage for saving and loading (measurement) data. There is an object for each supported data storage format, which currently includes:

TextDataStorage for text files
CsvDataStorage for csv files (specialized text file)
NpyDataStorage for numpy binary files (.npy)

There may be more supported storage formats in the future (e.g. database storage like SQL or HDF5) so you might want to check qudi.util.datastorage for any objects not listed in this documentation.
All these objects are derived from the abstract base class qudi.util.datastorage.DataStorageBase which is very loosely defining a generalized API for all storage classes and handles global metadata.
If you want to implement a new storage format class, it must inherit this base class.

The most important API methods that each specialized sub-class must implement are:

def save_data(self, data, *, metadata=None, notes=None, nametag=None, timestamp=None, **kwargs):
    # Save data to appropriate format
    pass

def load_data(self, *args, **kwargs):
    # Load data and metadata and return it
    pass

The exact method signatures with additional keyword-only arguments can differ between storage classes and can be looked up individually.

Before you can start saving or loading data arrays with the methods mentioned above, you need to instantiate and configure the storage object appropriately. Each specialized storage object can provide an entirely different set of parameters to initialize. You can look up configuration options for a specific storage object in the __init__ method doc string of the respective class.

So the first step before loading and saving data arrays is always to create an instance of the desired storage object.

Here is an example for storing text files that is using a commonly used subset of the available __init__ parameters to initialize the storage object:

from qudi.util.datastorage import TextDataStorage, ImageFormat

# Instantiate text storage object and configure it
data_storage = TextDataStorage(root_dir='C:\\Data\\MyMeasurementCategory',
                               comments='# ', 
                               delimiter='\t',
                               file_extension='.dat',
                               column_formats=('.8f', '.15e'),
                               include_global_metadata=True,
                               image_format=ImageFormat.PNG)

Let's go through the parameters one-by-one:

root_dir: The root or working directory for the storage class to work in. Files will be saved into this dir.
comments: String used at the start of lines in the text file to identify them as comment lines.
delimiter: Delimiter string used to separate data columns. Must be non-empty.
file_extension: The default file extension to use for new data files. Used if not explicit file name is provided
column_formats: Sequence of format specifiers for each column or a single specifier for all columns. If None (default) the column format is derived from the first data row. See also format specification mini-language
include_global_metadata: Flag indicating if global metadata should be automatically included when saving data.
image_format: The image format used to save matplotlib figures to file using storage method save_thumbnail.

Storage location

Generally you have to set the root_dir parameter for (file-based) storage objects before saving or loading any data.

For your convenience each qudi module (GUI, logic or hardware) has an attribute module_default_data_dir containing a standardized generic data directory. This directory respects the global config options default_data_dir and daily_data_dirs and adds a module-specific sub-directory. If applicable, you should always use this attribute to set root_dir in storage objects used by a qudi logic module.
By default this path resolves to: <user home>/qudi/Data/<YYYY>/<MM>/<YYYYMMDD>/<configured module name>

In case you really want to customize the storage location on a per-module basis, you should overwrite module_default_data_dir in the module class definition in order to make the custom path accessible from outside the module. By default all file based data is stored in daily sub-directories of the qudi data directory (default is <user_home>/qudi/Data/ but it can be changed via global config parameter default_data_dir).

Standalone scripts that use the qudi data storage objects obviously do not need to follow any convention and can customize root_dir however they like.

Saving data

The method save_data is used to store data in the desired format once the storage object has been initialized.
In the text file example from above this could look like:

import numpy as np
from datetime import datetime

# Create example data
x = np.linspace(0, 1, 1000)  # 1 sec time interval
y = np.sin(2 * np.pi * 2 * x)  # 2 Hz sine wave
data = np.asarray([x, y]).transpose()  # Format data into a single 2D array with x being the first 
                                       # column and y being the second column
                                       
# Prepare a dict containing metadata to be saved in the file header
metadata = {'sample_number': 42,
            'batch'        : 'xyz-123'}

# Create an explicit timestamp.
timestamp = datetime(2021, 5, 6, 11, 11, 11)  # 06.05.2021 at 11h:11m:11s
# timestamp = datetime.now()  # Usually you would use this

# Create a nametag to include in the file name (optional)
nametag = 'amplitude_measurement'

# Create an iterable of data column header strings (optional)
column_headers = ('time (s)', 'amplitude (V)')

# Create an arbitrary string of informal "lab notes" that is included in the file header
notes = 'This measurement was performed under the influence of 10 mugs of coffee and no sleep.'

# Save data to file
file_path, timestamp, (rows, columns) = data_storage.save_data(data, 
                                                               timestamp=timestamp, 
                                                               metadata=metadata, 
                                                               notes=notes,
                                                               nametag=nametag,
                                                               column_headers=column_headers,
                                                               column_dtypes=(float, float))

This will save the data to a file with a generic filename constructed from nametag and timestamp. <default_data_dir>/2021/05/20210506/20210506-1111-11_amplitude_measurement.dat with the following content:

# [General]
# timestamp=2021-05-06T11:11:11
# comments='# '
# delimiter='\t'
# column_dtypes=float;;float
# column_headers='time (s);;amplitude (V)'
# notes='This measurement was performed under the influence of 10 mugs of coffee and no sleep.'
# 
# [Metadata]
# sample_number=42
# batch='xyz-123'
# 
# ---- END HEADER ----
0.00000000	0.000000000000000e+00
0.00100100	1.257861783874106e-02
0.00200200	2.515524538937585e-02
⋮ 				⋮

NOTE: metadata keys must be str type and not contain leading or trailing whitespaces as well as avoid the pattern '[...]'.
NOTE: metadata values must be representable and reconstructable via repr and eval, i.e. value == eval(repr(value)).
NOTE: If column dtypes are explicitly given (as in the example), they must be one of int, float, complex or str. This will become important when loading back mixed data from disk. If column_dtypes is None (default) the dtypes will be automatically derived from the first data row.

Alternatively it is also possible to specify the filename directly instead of relying on the generic construction from nametag and timestamp:

# Save data to file
file_path, timestamp, (rows, columns) = data_storage.save_data(data, 
                                                               timestamp=timestamp,
                                                               metadata=metadata,
                                                               notes=notes,
                                                               column_headers=column_headers, 
                                                               column_dtypes=(float, float),
                                                               filename='my_custom_filename.abc')

This would result in a file at <default_data_dir>/2021/05/20210506/my_custom_filename.abc.
Please note that you need to provide the file extension as well in this case.

Saving a thumbnail

In order to save a thumbnail alongside the data file, you can create a matplotlib figure and pass it to the data storage method save_thumbnail.

save_thumbnail expects a full file path without file extension (this is automatically completed according to the configured image_format enum).
Usually you want your thumbnail file name to be the same as your data file name. An easy way to achieve that is to remove the file extension from the first return value of save_data and pass it to save_thumbnail.

To continue our example with text files, this could look like:

import matplotlib.pyplot as plt

# Create figure and plot data
fig = plt.figure()
ax = fig.add_subplot()
ax.plot(x, y)
ax.set_xlabel('time (s)')
ax.set_ylabel('amplitude (V)')

# Save figure as thumbnail with the same file name as the corresponding data file
figure_path = data_storage.save_thumbnail(fig, file_path.rsplit('.')[0])

This example creates the file: <default_data_dir>/2021/05/20210506/20210506-1111-11_amplitude_measurement.png

Loading data

All storage object provide means to load back data and corresponding metadata from disk.

ToDo: COMPLETE THIS SECTION

Global metadata

It is possible to set global metadata that will be automatically included in all data storage objects (class attribute of DataStorageBase) until it is actively removed again. So modules adding global metadata must handle robust and safe cleanup afterwards.

The global metadata is a dict and will be handled exactly the same as the metadata keyword-only parameter of the data storage save_data method. Except it does not need to be given each time data is saved and it applies globally to all data storage instances throughout the process. You can combine global metadata and locally provided metadata. The latter will always take precedence over the global metadata if keys are present in both dicts.

Adding global metadata

You can add global metadata key-value pairs by using the storage object class method <storage_class>.add_global_metadata. In our example from above this would look like:

# Create global metadata to ADD to the global metadata dict
global_meta = {'user': 'Batman'}

# Add metadata in a thread-safe way to ALL data storage objects 
data_storage.add_global_metadata(global_meta, overwrite=False)

# This would have the same effect
from qudi.util.datastorage import DataStorageBase
DataStorageBase.add_global_metadata(global_meta)
# ...or this
from qudi.util.datastorage import NpyDataStorage
NpyDataStorage.add_global_metadata(global_meta)

# You can also add a single key-value pair like this:
data_storage.add_global_metadata('frustration_level', 9000, overwrite=False)

Note the keyword-only overwrite parameter. If this flag is set to False (default) the method will raise a KeyError if any metadata keys to set are already present in the global metadata dict. If it is set to True this method will silently overwrite any key-value pairs.
It is highly recommended to use the default value (False) whenever possible in order to avoid hard to track bugs when two threads (i.e. qudi logic modules) are using the same metadata keys.

Removing global metadata

Always make sure the entity that added the global metadata also removes it, e.g. after it is not relevant anymore. For example the on_deactivate method of a qudi logic module would be a good place to remove any global metadata that has been added by the same module.
You can remove metadata using the storage object class method <storage_class>.remove_global_metadata, e.g. like:

# to remove a single key-value pair
data_storage.remove_global_metadata('user')

# or if you want to remove multiple key-value pairs with one call
data_storage.remove_global_metadata(['user', 'frustration_level'])

Reading global metadata

You can get a shallow copy of the global metadata dict via:

metadata = data_storage.get_global_metadata()

Since the returned dict is only a shallow copy of the actual global metadata dict one must avoid to mutate any of the values unless you are very sure what you are doing.

Logging Data

Another common use-case instead of dumping an entire data set at once is saving one chunk of data (or a single entry) at a time by appending to an already created file / database. This could for example be be useful for a data logger.

In order to do this, TextDataStorage and CsvDataStorage have additional API methods new_file and append_file.

new_file accepts the same keyword-only arguments as save_data and will create a new data file containing only the file header. The only difference is an additional keyword-only parameter dtype for which you should provide a numpy dtype since it can not be derived from the data array in this case (numpy.float will be assumed by default).

The created file can then be appended by single or multiple rows of data using append_file (you can also append files created by save_data).

An example:

# Create data file with the same variables as in the save_data example above
file_path, timestamp = data_storage.new_file(timestamp=timestamp,
                                             metadata=metadata,
                                             notes=notes,
                                             nametag=nametag,
                                             column_headers=column_headers,
                                             column_dtypes=(float, float))

# Append each row of the previously created data array one after the other
for data_row in data:
    data_storage.append_file(data_row, file_path)

# You can also append a chunk of multiple rows at once
data_storage.append_file(data[:10], file_path)

NOTE: appending to files like this is far less efficient than writing a single chunk of data at once. This comes from the implementation detail that each call to append_file will have the overhead of opening and closing a file handle.
If you are after high-frequency data logging, consider buffering data for a while and writing it out in chunks or implement a specialized data storage subclassing of TextDataStorage or DataStorageBase.

Thread-Safety

Saving and loading data using the data storage objects is generally not thread-safe. In the intended use case of multiple threads reading and writing non-shared individual files, this should not pose a problem.
Every thread should create its own instance of a data storage object and read/write different separate files.

The handling of the global parameters (read/add/remove) can be considered thread-safe.

index

qudi-core

A framework for modular measurement applications