Caterva: A Compressed And Multidimensional Container For Medium/Big Data

Caterva: A Compressed And Multidimensional Container For Medium/Big Data¶

Francesc Alted, Aleix Alcacer and Christian Steiner. SciPy Conference 2021.

Drawing Drawing

What Is Caterva?¶

Caterva is an open source C and Python library and a format that implements a compressed and multidimensional container.

It is different from existing solutions like HDF5 or Zarr in:

  • Higher performance: support for double partitioning.
  • Improved compression capabilities based on the new Blosc2 meta-compressor:
    • 64-bit containers
    • Data can be in-memory or on disk (persistent).
    • Filter pipeline. Use different filters simultaneously.
    • Plugin capabilities for filters and codecs: tailor it to your needs.
    • SIMD support for Intel, ARM and PowerPC.

https://github.com/Blosc/caterva/

https://github.com/Blosc/python-caterva/

Poster Outline¶

  1. Background
  2. Why Caterva?
    1. Double partitioning. Getting and setting data.
    2. No data types. Buffer and array protocols.
    3. Metalayers.
  3. ironArray and ironArray Community
  4. Future Work

Background¶

To understand Caterva it is important to know some terms that are directly related to it:

  • Data compression is the process of encoding data in order to reduce its size. Caterva usually works with compressed datasets, allowing for a more contained use of space in memory of on disk.

  • Data chunking is a technique that consists of dividing a dataset into partitions of a specific size (chunks). In addition, Caterva algorithms implement a second level of partitioning to achieve better performance.

Poster Outline¶

  1. Background
  2. Why Caterva?
    1. Double partitioning. Getting and setting data.
    2. No data types. Buffer and array protocols.
    3. Metalayers.
  3. ironArray and ironArray Community
  4. Future Work

Why Caterva?¶

Drawing

Caterva is a C library for handling multi-dimensional, chunked, compressed datasets in an easy and fast way.

  • Performant: leverage double partitioning for fast slicing.
  • Metalayers: add metadata to your arrays.
  • Type-less: flexibly define your own data types as metalayers.
  • Open source: https://github.com/Blosc/python-caterva.

Use cases¶

Caterva can be used for a great variety of scenarios. However, where it really stands out is with multidimensional ones. Specifically, Caterva is really useful for extracting slices of compressed data because, and thanks to the partitioning schema that it implements, it minimizes the amount of data that has to decompress so as to get the slice, making things faster.

Accordingly, for cases where the slicing performance is important, Caterva turns out to be a good alternative to other solutions like Zarr or HDF5.

Poster Outline¶

  1. Background
  2. Why Caterva?
    1. Double partitioning. Getting and setting data.
    2. No data types. Buffer and array protocols.
    3. Metalayers.
  3. ironArray and ironArray Community
  4. Future Work

Double partitioning¶

Data
Other chunking libraries store data into multidimensional chunks, which makes slices extraction from compressed data more efficient since only the chunks containing the slices are decompressed instead of the entire array.

In addition, Caterva also introduces a new level of partitioning. Within each chunk, the data is repartitioned into smaller multidimensional sets called blocks.

In this way, Caterva can read blocks individually (and also in parallel) instead of whole chunks, which improves slices extraction by decompressing only those blocks containing the slice.

Getting data¶

In this section, we are going to extract some hyperplanes from chunked arrays created with Caterva, Zarr, and HDF5. We will also analyze the performance differences between these libraries and how double partitioning affects Caterva.

In these three libraries, the data is stored using chunks of Blosc (that internally are split in blocks). However, while Zarr and HDF5 only introduce multidimensionality for chunks, Caterva introduces it for both chunks and blocks.

The chunks have also been optimized to extract hyperslices from the second dimension.

First dimension Second dimension

Thanks to the second level of partitioning, Caterva will decompress less data than the other formats.

In [1]:
import zarr
import caterva as cat
import numpy as np
import h5py as h5
import hdf5plugin as h5plugin
%load_ext memprofiler

First, shape, chunks and blocks parameters are defined. As we can see, the second dimension for those is smaller, and hence, optimized to extract hyperslices along it.

In [2]:
shape = (8_000, 8_000)
chunks = (4_000, 100)
blocks = (500, 25)
dtype = np.dtype("f8")
itemsize = dtype.itemsize

Then, a Caterva array, a Zarr array, and a HDF5 array are created from a NumPy array using the parameters defined before.

In [3]:
data = np.arange(np.prod(shape), dtype=dtype).reshape(shape)
In [4]:
c_data = cat.asarray(data, chunks=chunks, blocks=blocks)
In [5]:
z_data = zarr.array(data, chunks=chunks)
In [6]:
f = h5.File('hdf5_file.h5', 'w', driver="core")
f.create_dataset("data", shape, chunks=chunks, data=data, **h5plugin.Blosc())
h_data = f["data"]

Finally, some hyperplanes from the chunked arrays are extracted, and the performance is measured using the memprofiler plugin for jupyter.

In [7]:
planes_dim0 = np.random.randint(0, shape[0], 100)
In [8]:
%%mprof_run -q caterva::dim0

for i in planes_dim0:
    block = c_data[i, :]
In [9]:
%%mprof_run -q zarr::dim0

for i in planes_dim0:
    block = z_data[i, :]
In [10]:
%%mprof_run -q hdf5::dim0

for i in planes_dim0:
    block = h_data[i, :]
In [11]:
planes_dim1 = np.random.randint(0, shape[1], 100)
In [12]:
%%mprof_run -q caterva::dim1

for i in planes_dim1:
    block = c_data[:, i]
In [13]:
%%mprof_run -q zarr::dim1

for i in planes_dim1:
    block = z_data[:, i]
In [14]:
%%mprof_run -q hdf5::dim1

for i in planes_dim1:
    block = h_data[:, i]
In [15]:
f.close()
In [16]:
%mprof_barplot --title "Getting data (lower is better)" --variable time --groupby 1 .*

As we can see in the graph, the slicing times are similar in the optimized dimension. However, Caterva performs better (by far) in the non-optimized dimension. This is because with double partitioning you only have to decompress the blocks containing the slice instead of the hole chunk.

Setting data¶

Now, we are going to update some hyperplanes from chunked arrays created with Caterva, Zarr, and HDF5. As before, we will also analyze the performance differences between these libraries and how double partitioning affects Caterva.

First, let's define some necessary parameters.

In [17]:
shape = (8_000, 8_000)
chunks = (4_000, 100)
blocks = (500, 25)
dtype = np.dtype("f8")
itemsize = dtype.itemsize

Then, an empty array for each library is created with the previous parameters.

In [18]:
c_data = cat.empty(shape, itemsize, chunks=chunks, blocks=blocks)

z_data = zarr.empty(shape, dtype=dtype, chunks=chunks)

f = h5.File('hdf5_file.h5', 'w', driver="core")
f.create_dataset("data", shape, chunks=chunks, **h5plugin.Blosc())
h_data = f["data"]

Finally, some hyperplanes from the chunked arrays are updated and the performance is measured as in the previous section.

In [19]:
planes_dim0 = np.random.randint(0, shape[0], 100)
block_dim0 = np.arange(shape[0], dtype=dtype)
In [20]:
%%mprof_run -q caterva::dim0

for i in planes_dim0:
    c_data[i, :] = block_dim0
In [21]:
%%mprof_run -q zarr::dim0

for i in planes_dim0:
    z_data[i, :] = block_dim0
In [22]:
%%mprof_run -q hdf5::dim0

for i in planes_dim0:
    h_data[i, :] = block_dim0
In [23]:
planes_dim1 = np.random.randint(0, shape[1], 100)
block_dim1 = np.arange(shape[1], dtype=dtype)
In [24]:
%%mprof_run -q caterva::dim1

for i in planes_dim1:
    c_data[:, i] = block_dim1
In [25]:
%%mprof_run -q zarr::dim1

for i in planes_dim1:
    z_data[:, i] = block_dim1
In [26]:
%%mprof_run -q hdf5::dim1

for i in planes_dim1:
    h_data[:, i] = block_dim1
In [27]:
f.close()
In [28]:
%mprof_barplot --title "Setting data (lower is better)" --variable time --groupby 1 .*

In this case, the performance is also similar in the optimized dimension. However, there are differences in the non-optimized dimension. This because while Zarr and HDF5 only have to reorganize the data in chunks, Caterva has more work to do. As we explained, it also has to perform a second reorganization of the data because of the additional repartition for blocks.

Poster Outline¶

  1. Background
  2. Why Caterva?
    1. Double partitioning. Getting and setting data.
    2. No data types. Buffer and array protocols.
    3. Metalayers.
  3. ironArray and ironArray Community
  4. Future Work

No data types¶

Caterva only stores item size instead of the data type. The reasons for doing this are:

  • It provides more flexibility, allowing users to define their own custom data types via metalayers.
  • The Caterva library becomes much simpler, and hence more maintainable.
  • New libraries can be built on top of Caterva that define their own data types (e.g. ironArray).

Buffer and array protocol¶

Despite not providing a specific data type, Caterva supports both the buffer and array protocol. Let's see how it works.

In [29]:
import caterva as cat
import numpy as np

shape = (1_000, 1_000)
chunks = (500, 20)
blocks = (200, 10)
dtype = np.dtype("f8")
itemsize = dtype.itemsize

a = cat.empty(shape, itemsize, chunks=chunks, blocks=blocks)

for i in range(shape[0]):
    a[i] = np.linspace(0, 1, shape[1], dtype=dtype)

When a slice is extracted from Caterva, the result is still another Caterva array. However, this new array is not based on Blosc but on a simple buffer.

In [30]:
b = a[5:7, 5:10]

b.info
Out[30]:
TypeNDArray (Plainbuffer)
Itemsize8
Shape(2, 5)

In this way, the protocols mentioned above can be used to work with slices of Caterva arrays from other libraries.

But, what happen if we create a NumPy array from a Caterva array based on a simple buffer?

In [31]:
c = np.asarray(b)

c
Out[31]:
array([[b'\x02H\x01\xcd \x80t?', b'\x9c\x89\x01\xf6\xc0\x99x?',
        b'6\xcb\x01\x1fa\xb3|?', b'h\x06\x01\xa4\x80f\x80?',
        b"5'\x81\xb8Ps\x82?"],
       [b'\x02H\x01\xcd \x80t?', b'\x9c\x89\x01\xf6\xc0\x99x?',
        b'6\xcb\x01\x1fa\xb3|?', b'h\x06\x01\xa4\x80f\x80?',
        b"5'\x81\xb8Ps\x82?"]], dtype='|S8')

So in this case the data type inferred is a byte string. Caterva assigns internally this data type because it is needed to implement the protocols.

In order to obtain the original array, a cast to the data has to be performed.

To do this cast in NumPy, the ndarray.view(dtype) method can be used. In this case we want to view the NumPy array as an array of doubles.

In [32]:
c = np.asarray(b).view(dtype)

c
Out[32]:
array([[0.00500501, 0.00600601, 0.00700701, 0.00800801, 0.00900901],
       [0.00500501, 0.00600601, 0.00700701, 0.00800801, 0.00900901]])

Finally, here is what happens when some elements of the Caterva array are updated.

In [33]:
b[0] = np.arange(5, dtype=dtype)

c
Out[33]:
array([[0.        , 1.        , 2.        , 3.        , 4.        ],
       [0.00500501, 0.00600601, 0.00700701, 0.00800801, 0.00900901]])

As can be seen, the updates also appear in the NumPy array. That is because the data buffer is shared between the Caterva array and the NumPy array. Therefore, having the buffer and array protocols allow Caterva to share data between different libraries without copies.

Poster Outline¶

  1. Background
  2. Why Caterva?
    1. Double partitioning. Getting and setting data.
    2. No data types. Buffer and array protocols.
    3. Metalayers.
  3. ironArray and ironArray Community
  4. Future Work

Metalayers¶

Drawing

Metalayers are small metadata for informing about the kind of data that is stored on a Blosc2 container. Caterva specifies its own metalayer for storing multidimensional information. This metalayer can be modified so that the shapes could be updated.

In general, you can use metalayers for adapting Blosc2 containers (and in particular, Caterva arrays) to your own needs.

Caterva metalayer¶

First, we define the shape and the chunks and blocks for the arrays. Then, we create an array with one metalayer storing a date.

In [34]:
import caterva as cat
from struct import pack

urlpath = "arr_with_meta.caterva"

shape = (1_000, 1_000)
chunks = (500, 500)
blocks = (10, 250)

meta = {
    b"date": b"01/01/2021"
}

a = cat.full(shape, fill_value=pack("f", 3.14), chunks=chunks, blocks=blocks, meta=meta,
             urlpath=urlpath)
In [35]:
a = cat.open(urlpath)

Then we get the name of all metalayers on the array:

In [36]:
a.meta.keys()
Out[36]:
['caterva', 'date']

With that, we can get the information stored in e.g. the date metalayer:

In [37]:
assert a.meta.get("date") == a.meta["date"]

a.meta["date"]
Out[37]:
b'01/01/2021'

Now, let's update the content of the date metalayer. It is important to remember that the length of a Caterva metalayer can not change, so you must be careful when updating.

In [38]:
a.meta["date"] = b"08/01/2021"
try:
    a.meta["date"] = b"8/1/2021"
except ValueError as err:
    print(err)
The length of the content in a metalayer cannot change.

Finally, you must know that Caterva introduces a metalayer in the array storing the multidimensional information. You can inspect such metalayer easily:

In [39]:
import msgpack

caterva_meta = msgpack.unpackb(a.meta.get("caterva"))

print(f"Format version: {caterva_meta[0]}")
print(f"N. dimensions: {caterva_meta[1]}")
print(f"Shape: {caterva_meta[2]}")
print(f"Chunks: {caterva_meta[3]}")
print(f"Blocks: {caterva_meta[4]}")

cat.remove(urlpath)
Format version: 0
N. dimensions: 2
Shape: [1000, 1000]
Chunks: [500, 500]
Blocks: [10, 250]

Poster Outline¶

  1. Background
  2. Why Caterva?
    1. Double partitioning. Getting and setting data.
    2. No data types. Buffer and array protocols.
    3. Metalayers.
  3. ironArray and ironArray Community
  4. Future Work

ironArray¶

Drawing

ironArray is a library built on top of Caterva. It is a powerful, flexible and fast toolkit for managing and computing with floating-point datasets.

The highlights of ironArray are:

  • High performance array and vector computations.
  • Automatic data compression and decompression.
  • Contiguous or sparse storage.
  • Tunable performance optimizations that leverage your specific CPUs caches, memory and disks.

For more information about ironArray, see: https://ironarray.io

Computation performance in ironArray¶

In order to better grasp what compression can bring to high performance computing, and in particular, how it can contribute to break the memory wall, let's see an example of computation with actual data (coming from a precipitation dataset). Below we can see the performance of ironArray (ia) and NumPy (np) computing the mean of three datasets:

Drawing

ironArray will use artificial intelligence algorithms for achieving outstanding execution times. Choose between speed, compression ratio or a balance among the two (the default) at your will, and ironArray will decide the best parameters for completing the computation.

ironArray Community¶

Drawing

ironArray Community is the open source version of ironArray. It has been developed to mimic the same API than h5py or Zarr. It implements the support for simple and double floating-point data using a metalayer. With the community edition of ironArray you can extract slices from floating-point datasets in a simple way!

For more information about ironArray Community, see: https://ironarray.io/products

In [40]:
import iarray_community as ia
import numpy as np

shape = (1_000, 1_000)

chunks = (500, 500)
blocks = (100, 100)
dtype = np.float64
In [41]:
data = ia.zeros(shape, dtype=dtype, chunks=chunks, blocks=blocks, codec=ia.Codecs.LZ4)

data.info
Out[41]:
typeIArray
shape(1000, 1000)
chunks(500, 500)
blocks(100, 100)
cratio31250.00
In [42]:
data[0] = np.linspace(0, 1, shape[1], dtype=dtype)

s = data[0, 250:-740]

type(s)
Out[42]:
numpy.ndarray
In [43]:
s
Out[43]:
array([0.25025025, 0.25125125, 0.25225225, 0.25325325, 0.25425425,
       0.25525526, 0.25625626, 0.25725726, 0.25825826, 0.25925926])

Poster Outline¶

  1. Background
  2. Why Caterva?
    1. Double partitioning. Getting and setting data.
    2. No data types. Buffer and array protocols.
    3. Metalayers.
  3. ironArray and ironArray Community
  4. Future Work

Future Work¶

  • Resize array dimensions.

  • Improve slicing capabilities: currently Caterva only supports basic slicing based on start:stop ranges; we would like to extend this to start:stop:step as well as selections based on an array of booleans (similar to NumPy).

  • Support for variable-length metalayers: this would provide users a lot of flexibility to define their own metadata.

Acknowledgments¶

The Blosc team would like to thank our sponsors:

  • Huawei for making a generous donation that allowed us to get started with Caterva.

  • The NumFOCUS foundation for several small development grants (SDG).

  • ironArray SL for making a donation to finish outlining Caterva.

Last but not least, thanks to the SciPy Conference for enabling the opportunity to introduce Caterva to the community.