Francesc Alted, Aleix Alcacer and Christian Steiner. SciPy Conference 2021.
Caterva is an open source C and Python library and a format that implements a compressed and multidimensional container.
It is different from existing solutions like HDF5 or Zarr in:
To understand Caterva it is important to know some terms that are directly related to it:
Data compression is the process of encoding data in order to reduce its size. Caterva usually works with compressed datasets, allowing for a more contained use of space in memory of on disk.
Data chunking is a technique that consists of dividing a dataset into partitions of a specific size (chunks). In addition, Caterva algorithms implement a second level of partitioning to achieve better performance.
Caterva is a C library for handling multi-dimensional, chunked, compressed datasets in an easy and fast way.
Caterva can be used for a great variety of scenarios. However, where it really stands out is with multidimensional ones. Specifically, Caterva is really useful for extracting slices of compressed data because, and thanks to the partitioning schema that it implements, it minimizes the amount of data that has to decompress so as to get the slice, making things faster.
Accordingly, for cases where the slicing performance is important, Caterva turns out to be a good alternative to other solutions like Zarr or HDF5.
In addition, Caterva also introduces a new level of partitioning. Within each chunk, the data is repartitioned into smaller multidimensional sets called blocks.
In this way, Caterva can read blocks individually (and also in parallel) instead of whole chunks, which improves slices extraction by decompressing only those blocks containing the slice.
In this section, we are going to extract some hyperplanes from chunked arrays created with Caterva, Zarr, and HDF5. We will also analyze the performance differences between these libraries and how double partitioning affects Caterva.
In these three libraries, the data is stored using chunks of Blosc (that internally are split in blocks). However, while Zarr and HDF5 only introduce multidimensionality for chunks, Caterva introduces it for both chunks and blocks.
The chunks have also been optimized to extract hyperslices from the second dimension.
First dimension | Second dimension |
Thanks to the second level of partitioning, Caterva will decompress less data than the other formats.
import zarr
import caterva as cat
import numpy as np
import h5py as h5
import hdf5plugin as h5plugin
%load_ext memprofiler
First, shape, chunks and blocks parameters are defined. As we can see, the second dimension for those is smaller, and hence, optimized to extract hyperslices along it.
shape = (8_000, 8_000)
chunks = (4_000, 100)
blocks = (500, 25)
dtype = np.dtype("f8")
itemsize = dtype.itemsize
Then, a Caterva array, a Zarr array, and a HDF5 array are created from a NumPy array using the parameters defined before.
data = np.arange(np.prod(shape), dtype=dtype).reshape(shape)
c_data = cat.asarray(data, chunks=chunks, blocks=blocks)
z_data = zarr.array(data, chunks=chunks)
f = h5.File('hdf5_file.h5', 'w', driver="core")
f.create_dataset("data", shape, chunks=chunks, data=data, **h5plugin.Blosc())
h_data = f["data"]
Finally, some hyperplanes from the chunked arrays are extracted, and the performance is measured using the memprofiler plugin for jupyter.
planes_dim0 = np.random.randint(0, shape[0], 100)
%%mprof_run -q caterva::dim0
for i in planes_dim0:
block = c_data[i, :]
%%mprof_run -q zarr::dim0
for i in planes_dim0:
block = z_data[i, :]
%%mprof_run -q hdf5::dim0
for i in planes_dim0:
block = h_data[i, :]
planes_dim1 = np.random.randint(0, shape[1], 100)
%%mprof_run -q caterva::dim1
for i in planes_dim1:
block = c_data[:, i]
%%mprof_run -q zarr::dim1
for i in planes_dim1:
block = z_data[:, i]
%%mprof_run -q hdf5::dim1
for i in planes_dim1:
block = h_data[:, i]
f.close()
%mprof_barplot --title "Getting data (lower is better)" --variable time --groupby 1 .*
As we can see in the graph, the slicing times are similar in the optimized dimension. However, Caterva performs better (by far) in the non-optimized dimension. This is because with double partitioning you only have to decompress the blocks containing the slice instead of the hole chunk.
Now, we are going to update some hyperplanes from chunked arrays created with Caterva, Zarr, and HDF5. As before, we will also analyze the performance differences between these libraries and how double partitioning affects Caterva.
First, let's define some necessary parameters.
shape = (8_000, 8_000)
chunks = (4_000, 100)
blocks = (500, 25)
dtype = np.dtype("f8")
itemsize = dtype.itemsize
Then, an empty array for each library is created with the previous parameters.
c_data = cat.empty(shape, itemsize, chunks=chunks, blocks=blocks)
z_data = zarr.empty(shape, dtype=dtype, chunks=chunks)
f = h5.File('hdf5_file.h5', 'w', driver="core")
f.create_dataset("data", shape, chunks=chunks, **h5plugin.Blosc())
h_data = f["data"]
Finally, some hyperplanes from the chunked arrays are updated and the performance is measured as in the previous section.
planes_dim0 = np.random.randint(0, shape[0], 100)
block_dim0 = np.arange(shape[0], dtype=dtype)
%%mprof_run -q caterva::dim0
for i in planes_dim0:
c_data[i, :] = block_dim0
%%mprof_run -q zarr::dim0
for i in planes_dim0:
z_data[i, :] = block_dim0
%%mprof_run -q hdf5::dim0
for i in planes_dim0:
h_data[i, :] = block_dim0
planes_dim1 = np.random.randint(0, shape[1], 100)
block_dim1 = np.arange(shape[1], dtype=dtype)
%%mprof_run -q caterva::dim1
for i in planes_dim1:
c_data[:, i] = block_dim1
%%mprof_run -q zarr::dim1
for i in planes_dim1:
z_data[:, i] = block_dim1
%%mprof_run -q hdf5::dim1
for i in planes_dim1:
h_data[:, i] = block_dim1
f.close()
%mprof_barplot --title "Setting data (lower is better)" --variable time --groupby 1 .*
In this case, the performance is also similar in the optimized dimension. However, there are differences in the non-optimized dimension. This because while Zarr and HDF5 only have to reorganize the data in chunks, Caterva has more work to do. As we explained, it also has to perform a second reorganization of the data because of the additional repartition for blocks.
Caterva only stores item size instead of the data type. The reasons for doing this are:
Despite not providing a specific data type, Caterva supports both the buffer and array protocol. Let's see how it works.
import caterva as cat
import numpy as np
shape = (1_000, 1_000)
chunks = (500, 20)
blocks = (200, 10)
dtype = np.dtype("f8")
itemsize = dtype.itemsize
a = cat.empty(shape, itemsize, chunks=chunks, blocks=blocks)
for i in range(shape[0]):
a[i] = np.linspace(0, 1, shape[1], dtype=dtype)
When a slice is extracted from Caterva, the result is still another Caterva array. However, this new array is not based on Blosc but on a simple buffer.
b = a[5:7, 5:10]
b.info
Type | NDArray (Plainbuffer) |
---|---|
Itemsize | 8 |
Shape | (2, 5) |
In this way, the protocols mentioned above can be used to work with slices of Caterva arrays from other libraries.
But, what happen if we create a NumPy array from a Caterva array based on a simple buffer?
c = np.asarray(b)
c
array([[b'\x02H\x01\xcd \x80t?', b'\x9c\x89\x01\xf6\xc0\x99x?', b'6\xcb\x01\x1fa\xb3|?', b'h\x06\x01\xa4\x80f\x80?', b"5'\x81\xb8Ps\x82?"], [b'\x02H\x01\xcd \x80t?', b'\x9c\x89\x01\xf6\xc0\x99x?', b'6\xcb\x01\x1fa\xb3|?', b'h\x06\x01\xa4\x80f\x80?', b"5'\x81\xb8Ps\x82?"]], dtype='|S8')
So in this case the data type inferred is a byte string. Caterva assigns internally this data type because it is needed to implement the protocols.
In order to obtain the original array, a cast to the data has to be performed.
To do this cast in NumPy, the ndarray.view(dtype)
method can be used. In this case we want to view the NumPy array as an array of doubles.
c = np.asarray(b).view(dtype)
c
array([[0.00500501, 0.00600601, 0.00700701, 0.00800801, 0.00900901], [0.00500501, 0.00600601, 0.00700701, 0.00800801, 0.00900901]])
Finally, here is what happens when some elements of the Caterva array are updated.
b[0] = np.arange(5, dtype=dtype)
c
array([[0. , 1. , 2. , 3. , 4. ], [0.00500501, 0.00600601, 0.00700701, 0.00800801, 0.00900901]])
As can be seen, the updates also appear in the NumPy array. That is because the data buffer is shared between the Caterva array and the NumPy array. Therefore, having the buffer and array protocols allow Caterva to share data between different libraries without copies.
Metalayers are small metadata for informing about the kind of data that is stored on a Blosc2 container. Caterva specifies its own metalayer for storing multidimensional information. This metalayer can be modified so that the shapes could be updated.
In general, you can use metalayers for adapting Blosc2 containers (and in particular, Caterva arrays) to your own needs.
First, we define the shape and the chunks and blocks for the arrays. Then, we create an array with one metalayer storing a date.
import caterva as cat
from struct import pack
urlpath = "arr_with_meta.caterva"
shape = (1_000, 1_000)
chunks = (500, 500)
blocks = (10, 250)
meta = {
b"date": b"01/01/2021"
}
a = cat.full(shape, fill_value=pack("f", 3.14), chunks=chunks, blocks=blocks, meta=meta,
urlpath=urlpath)
a = cat.open(urlpath)
Then we get the name of all metalayers on the array:
a.meta.keys()
['caterva', 'date']
With that, we can get the information stored in e.g. the date metalayer:
assert a.meta.get("date") == a.meta["date"]
a.meta["date"]
b'01/01/2021'
Now, let's update the content of the date metalayer. It is important to remember that the length of a Caterva metalayer can not change, so you must be careful when updating.
a.meta["date"] = b"08/01/2021"
try:
a.meta["date"] = b"8/1/2021"
except ValueError as err:
print(err)
The length of the content in a metalayer cannot change.
Finally, you must know that Caterva introduces a metalayer in the array storing the multidimensional information. You can inspect such metalayer easily:
import msgpack
caterva_meta = msgpack.unpackb(a.meta.get("caterva"))
print(f"Format version: {caterva_meta[0]}")
print(f"N. dimensions: {caterva_meta[1]}")
print(f"Shape: {caterva_meta[2]}")
print(f"Chunks: {caterva_meta[3]}")
print(f"Blocks: {caterva_meta[4]}")
cat.remove(urlpath)
Format version: 0 N. dimensions: 2 Shape: [1000, 1000] Chunks: [500, 500] Blocks: [10, 250]
ironArray is a library built on top of Caterva. It is a powerful, flexible and fast toolkit for managing and computing with floating-point datasets.
The highlights of ironArray are:
For more information about ironArray, see: https://ironarray.io
In order to better grasp what compression can bring to high performance computing, and in particular, how it can contribute to break the memory wall, let's see an example of computation with actual data (coming from a precipitation dataset). Below we can see the performance of ironArray (ia) and NumPy (np) computing the mean of three datasets:
ironArray will use artificial intelligence algorithms for achieving outstanding execution times. Choose between speed, compression ratio or a balance among the two (the default) at your will, and ironArray will decide the best parameters for completing the computation.
ironArray Community is the open source version of ironArray. It has been developed to mimic the same API than h5py or Zarr. It implements the support for simple and double floating-point data using a metalayer. With the community edition of ironArray you can extract slices from floating-point datasets in a simple way!
For more information about ironArray Community, see: https://ironarray.io/products
import iarray_community as ia
import numpy as np
shape = (1_000, 1_000)
chunks = (500, 500)
blocks = (100, 100)
dtype = np.float64
data = ia.zeros(shape, dtype=dtype, chunks=chunks, blocks=blocks, codec=ia.Codecs.LZ4)
data.info
type | IArray |
---|---|
shape | (1000, 1000) |
chunks | (500, 500) |
blocks | (100, 100) |
cratio | 31250.00 |
data[0] = np.linspace(0, 1, shape[1], dtype=dtype)
s = data[0, 250:-740]
type(s)
numpy.ndarray
s
array([0.25025025, 0.25125125, 0.25225225, 0.25325325, 0.25425425, 0.25525526, 0.25625626, 0.25725726, 0.25825826, 0.25925926])
Resize array dimensions.
Improve slicing capabilities: currently Caterva only supports basic slicing based on start:stop ranges; we would like to extend this to start:stop:step as well as selections based on an array of booleans (similar to NumPy).
Support for variable-length metalayers: this would provide users a lot of flexibility to define their own metadata.
The Blosc team would like to thank our sponsors:
Huawei for making a generous donation that allowed us to get started with Caterva.
The NumFOCUS foundation for several small development grants (SDG).
ironArray SL for making a donation to finish outlining Caterva.
Last but not least, thanks to the SciPy Conference for enabling the opportunity to introduce Caterva to the community.