ESR13 - Integration of storage and programming models for Exascale applications

Recruiting Institution/Company: Barcelona Supercomputing Center (Spain)

PhD awarded by: Universitat Politecnica de Catalunya (Spain)

In the last two decades, the HDF5 file format [1] has become a de-facto standard for storing high-volume scientific data used in large-scale HPC computations.

While the user-friendly APIs provided by the HDF library coupled to the simplicity of a POSIX-style abstract data model are guaranteeing a good user experience, at lower level HDF5 presents a severe scalability problem that needs to be addressed: the underlying use of parallel file systems is clearly making it hard to exploit locality since data always need to be moved from the storage node(s) to the computational one(s). The objective of this project is to overcome this bottleneck in order to fully exploit the capabilities of next-generation storage class memories in distributed environments.

Better I/O scaling.

Create a novel object oriented model mirroring a HDF5 file internal structure as a base for the definition of a generic connector finalized to plug the HDF library on different persistent object stores [2] through the virtual object layer [3]. The interface mapping complex HDF5 data types on object oriented classes allows any persistent object store to manage HDF5 datasets as objects whose sub-components - namely other objects - are independently accessible and manipulable.

Distributed persistent object store enhancements.

Formalize a new middleware approach to improve I/O in HDF5 applications by plugging the distributed persistent object store dataClay [4] through the above mentioned connector and take advantage the following aspects:

1. Data distribution is performed by the middleware without needing to implement in the application the partitioning logic.
2. Computation is moved on the nodes data are residing in, instead of moving data between nodes.
3. Data replication and synchronization are fine grained, allowing users to control number of copies and consistency policies on specific data subsets.

The methodology has been validated by executing on multiple Grid'5000 [5] machines a set of benchmarks based on the following real-world applications natively relying on HDF5:

VPIC [6], a general purpose particle-in-cell simulation code for modeling kinetic plasmas.

BD-CATS [7], a software similar to VPIC able to analyze trillion particles for large scale cosmologic computations.

AMfe-FETI [8], a library implementing parallel efficient FETI solvers.

 

VPIC and BD-CATS have been chosen because they are presenting relevant I/O patterns and the improvements have been evaluated on (1) small random writes, (2) large sequential writes and (3) large sequential reads. The tests on AMfe-FETI solvers, instead, has been aimed to evaluate the performance gains obtainable by tuning fine-grained synchronization mechanisms.

The results have shown that - by bypassing the parallel file system infrastructure and rely on the capabilities of a distributed persistent object store – it is possible to achieve relevant I/O improvements, speeding up applications up to 10x in certain cases.

References

[1] The HDF Group, “Hierarchical Data Format, version 5.” 1997–2018.
[2] A. L. Brown and R. Morrison, “A generic persistent object store,” Software Engineering Journal,vol. 7, 1992
[3] M. Chaarawi and Q. Koziol, “HDF5 Virtual Object Layer. Technical report.” 2011
[4] J. Martí, A. Queralt, T. Cortes et al., “Dataclay: A distributed data store for effective inter-player data sharing,” in Journal of systems and software, 2017, vol. 131.
[5] F. Cappello et al., "Grid'5000: a large scale and highly reconfigurable grid experimental testbed," The 6th IEEE/ACM International Workshop on Grid Computing, 2005., 2005, pp. 8 pp.-, doi: 10.1109/GRID.2005.1542730.
[6] K. Bowers et al., “Ultrahigh performance three-dimensional electromagnetic relativistic kinetic plasma simulation,” Physics of Plasmas, vol. 15, 2008.
[7] M. Patwary et al., “BD-CATS: Big Data Clustering at Trillion Particle Scale,” in International Conference for HPC, Networking, Storage and Analysis, 2015.
[8] G. Jenovencio, “Amfe-FETI”, github.com/jenovencio/Amfe-FETI, 2017-2019

Work package

Supervisor

Co-supervisors

Early Stages Researcher

© Politecnico di Torino - Credits