вход по аккаунту


978-3-319-67630-2 51

код для вставкиСкачать
JULEA: A Flexible Storage Framework for HPC
Michael Kuhn(B)
Universität Hamburg, 20146 Hamburg, Germany
Abstract. JULEA is a flexible storage framework that allows offering
arbitrary client interfaces to applications. To be able to rapidly prototype
new approaches, it offers data and metadata backends that can either
be client-side or server-side; backends for popular storage technologies
such as POSIX, LevelDB and MongoDB have already been implemented.
Additionally, JULEA allows dynamically adapting the I/O operations’
semantics and can thus be adjusted to different use-cases. It runs completely in user space, which eases development and debugging. Its goal
is to provide a solid foundation for storage research and teaching.
Keywords: Flexible storage framework · High performance computing ·
Parallel file system · Object store · Key-value store
File systems are typically monolithic in design: They support a single storage
backend, a single interface and a single set of semantics. While this design has
benefits with regards to portability, it is too inflexible for research and teaching.
It makes it hard to try new algorithms and approaches because they often require
changes to many different components of the file system.
There are two majors problems caused by this: On the one hand, many specialized solutions are created that try to solve a particular problem [11,19,21].
While these are often based on existing file systems, the code is seldom contributed back because it does not meet the original design goals; this makes it
hard to maintain these approaches in the long term. On the other hand, it is
necessary to have a more or less complete understanding of the file systems due
to their complex design. This is especially problematic in the context of shorter
projects and presents an unnecessary hurdle for young researchers and students
to gain experience with file systems.
A possible solution for these problems is a flexible storage framework that is
extensible using plugins for its application-facing interface, its storage backend
and its internal behavior, that is, its semantics. This provides the flexibility
required to support the many different use-cases found in HPC.
Many applications do not access the file system directly but instead rely on
high-level libraries to perform I/O efficiently. This is especially common in scientific applications where exchangeability of data is a primary concern; libraries for
c Springer International Publishing AG 2017
J.M. Kunkel et al. (Eds.): ISC High Performance Workshops 2017, LNCS 10524, pp. 712–723, 2017.
JULEA: A Flexible Storage Framework for HPC
self-describing data formats such as NetCDF and HDF5 are widely used there.
An exemplary software stack is shown in Fig. 1a. Applications only directly interface with NetCDF, which depends on HDF5 and so on. Due to the complex
interplay of different components and optimizations in this stack, performance
issues are a common occurrence. One of the reasons are the strict POSIX semantics that are typically provided by the underlying parallel file system and forced
upon the upper layers [18].
Fig. 1. Current HPC I/O stack and proposed JULEA I/O stack
Multiple projects are currently investigating possibilities of eliminating this
problem by integrating the I/O libraries and file systems more closely [5,12].
Providing such an interface natively could have many benefits but is hard to
achieve with current file systems. There are also approaches to combine technologies from the HPC and big data fields and use object stores instead of
full-fledged file systems [3,13]. Additionally, there has been research regarding
alternative file system interfaces [2,17] and to allow the file systems semantics
to be adapted according to the applications’ requirements [1,9,19].
While some of these approaches can use existing storage systems and extend
them according to their goals, many need to implement basic functionality from
scratch because they do not fit within the architecture of existing solutions.
The main contribution of this paper is JULEA, a flexible storage framework that can be used to rapidly prototype new approaches in research and
teaching. Therefore, it provides basic storage building blocks that are powerful
yet generic enough to support parallel and distributed storage use-cases. This
kind of flexible functionality requires well-defined and well-documented plugin
interfaces with common requirements in mind. To make it more accessible to
developers, readability is favored over performance and there is a clear separation of functionality. It is possible to run it without system-level access to enable
easy large-scale experiments on supercomputers, where such access is typically
not available.
The resulting software stack is shown in Fig. 1b. All functionality is provided
in user space to ease development and debugging. Existing I/O libraries like
M. Kuhn
MPI-IO or HDF5 can be adapted to make use of JULEA’s functionality. Alternatively, applications can interface directly with JULEA, which manages data
and metadata storage.
This paper is structured as follows: Sect. 2 presents JULEA’s goals and design
in detail; this includes the general architecture and different components. The
implementation’s current status is shown in Sect. 3. Some preliminary evaluation
results are displayed in Sect. 4. Related work is presented in Sect. 5. Finally, the
paper is concluded and future work is presented in Sect. 6.
JULEA follows a traditional client-server design where clients communicate with
servers via the network. In contrast to existing file systems that only provide a
single client interface, JULEA makes it possible to offer arbitrary interfaces to
applications. This can be used to offer traditional file system interfaces as well
as completely new types of interfaces. The servers are able to use a multitude
of existing storage technologies to foster experimentation; this is achieved by
supporting multiple backends. To facilitate rapid prototyping, both clients and
backends are easy to implement and exchange. Additionally, JULEA supports
dynamically adaptable semantics for all I/O operations. This allows clients to
support a wide range of use-cases, such as the very strict POSIX semantics as
well as the more relaxed MPI-IO semantics.
Fig. 2. JULEA’s main components
Figure 2 shows the main components of JULEA’s design. An application can
use one or more JULEA clients to talk to the storage servers. While applications
can directly use JULEA’s clients, it is also possible to adapt I/O libraries to make
use of them; for instance, this could be used to provide an appropriate MPI-IO
module or HDF5 plugin. JULEA’s servers are split into data and metadata
servers, which allows tuning the servers for their respective access patterns.
Clients are completely unrestricted regarding the interface they provide. Traditional file systems typically offer a single interface that can not be changed
JULEA: A Flexible Storage Framework for HPC
easily due to it being interwoven with the rest of the file system architecture.
Therefore, it is often only possible to add extensions to these existing interfaces,
which limits the amount and degree of experimentation.
Because JULEA will be implemented in user space, arbitrary interfaces can
be provided. This is typically problematic for kernel space file systems, whose
client interfaces are restricted by the VFS. Clients can either be directly used
by applications or offer interfaces to be used by high-level I/O libraries.
To allow backends to be optimized for different use-cases and access patterns,
they are separated into data and metadata backends and are used by data and
metadata servers, respectively. While data backends are meant to serve large
streaming I/O, metadata backends should excel at small random accesses. Data
backends manage objects and their interface is therefore very close to popular
object stores and file systems. Metadata backends manage key-value pairs with
an appropriate interface for this use-case.
To define an appropriate interface for the data backends, interfaces of existing file systems (such as Lustre and OrangeFS), object stores (such as Ceph’s
RADOS) and I/O interfaces (such as MPI-IO) have been taken into consideration. The resulting functions supported by data backends are as follows:
create: Creates an object given by its name.
open: Opens an object given by its name.
delete: Deletes an object.
close: Closes an object.
status: Returns an object’s modification time and size.
sync: Syncs an object to the underlying storage device.
read: Reads data of a given length from an object at the specified offset.
write: Writes data of a given length to an object at the specified offset.
The create and open functions return an object handle on success that can
then be used with all other functions. The delete and close functions destroy the
object handle. In contrast to POSIX’s stat, the status function only returns
very basic information to be able to support a wide range of data backends.
As for the data backends, existing database (such as SQLite and MongoDB)
and key-value (such as LevelDB and LMDB) solutions have been investigated to
define a common interface for all metadata backends. This resulted in a common
set of functions that are offered by metadata backends:
batch start: Starts a batch that can include put and delete operations.
batch execute: Executes a batch.
put: Stores a value for a given key.
delete: Delete a key-value pair for a given key.
get: Returns the value for a given key.
get all: Returns all values.
M. Kuhn
– get by prefix: Returns all values for keys starting with a given prefix.
– iterate: Iterates over a multi-value result.
Batches allow aggregating multiple put and delete operations to improve
performance; this functionality is also commonly found in current database systems. While the get function returns the value for a single key, the get all and
get by prefix functions can return multiple values. For this reason, the iterate
function allows iterating over the respective results.
Data and metadata backends support namespaces to allow multiple clients to
co-exist and not interfere with each other. Additionally, they support initialization and finalization functions to set up and destroy necessary data structures.
JULEA allows many aspects of its operations’ semantics to be changed at runtime. Several key areas of the semantics have been identified as important to
provide opportunities for optimizations and are briefly described below. Even
though it is possible to mix the settings for each of these semantics, not all
combinations might produce reasonable results. Semantics templates make it
possible to easily emulate existing semantics such as POSIX.
– The atomicity semantics can be used to specify whether or not it is possible
for clients to see intermediate states of operations. These are possible because
large operations usually involve several servers. If atomicity is required, some
kind of locking has to be performed to prevent other clients from accessing
data that is currently being modified.
– The concurrency semantics can be used to specify whether concurrent
accesses will take place and, if so, how the access pattern will look like.
This allows handling different patterns appropriately without the need for
heuristics to recognize them. Depending on the level of concurrency, different
algorithms might be appropriate for operations such as locking or metadata
– The consistency semantics can be used to specify if and when clients will
see modifications performed by other clients and applies to both metadata
and data. This information can be used to enable client-side read caching
whenever possible.
– The ordering semantics can be used to specify whether operations are allowed
to be reordered. Because there can be a large number of operations, the
additional information can be exploited to optimize their execution.
– The persistency semantics can be used to specify if and when data and metadata must be written to persistent storage. This can be used to enable clientside write caching whenever possible.
– The safety semantics can be used to specify how safely data and metadata
should be handled. It provides guarantees about the state of the data and
metadata after the execution of operations has finished.
For more in-depth information about JULEA’s semantics, please see [8–10].
JULEA: A Flexible Storage Framework for HPC
The design discussed in the previous section has been implemented within the
JULEA project, which is freely available.1 It is written in modern C11 code and
features only two mandatory dependencies (GLib [6], libbson [14]) to make it
easily portable. The code uses the GNU Lesser General Public License (LGPL
3.0 or later) to allow proprietary clients and backends in addition to the available
open source ones.
While the clients are provided in the form of shared libraries that can be
linked into the application, the server is a specialized program that can function
as both a data and metadata server. The shared libraries are written in such a
way as to allow applications to use multiple clients at the same time. Backends
are also built as shared libraries and can be loaded by the clients and servers.
Fig. 3. JULEA’s architecture with two applications using different configurations
Figure 3 shows two exemplary uses of JULEA. Both applications use
JULEA’s item client ( that provides an easy-to-use cloudlike I/O interface. For the application on the right side, JULEA has been configured to use its client-side MongoDB metadata backend (
and its server-side POSIX data backend ( JULEA’s core library
M. Kuhn
( automatically loads all required client-side backends at runtime.
The client forwards all requests to the core library, which in turn forwards them
to the appropriate servers. For the application on the left side, JULEA has been
configured to use its server-side LevelDB metadata backend (
and its server-side POSIX data backend ( In this case, no additional client-side backend has to be loaded. Both configurations are completely
transparent for the application and provide the same functionality.
JULEA’s flexibility results in a high number of possible configurations: In
addition to the data and metadata backends being configurable, the semantics
can be set for batches of operations. To facilitate easy verification and performance evaluation, JULEA contains extensive test and benchmark suites.
Additionally, JULEA includes miscellaneous utilities (a command line interface for creating, listing and deleting objects and key-value pairs; a tool for
manipulating JULEA’s configuration; a tool to gather server statistics) and
proof-of-concept codes (a FUSE file system using JULEA). This makes it easy
to achieve fast results with JULEA and provides insight into its internals.
Clients provide interfaces that can be used by applications or other I/O libraries.
They are typically required to use their own, separate namespaces to not interfere
with each other. This makes it possible to use multiple clients on top of the same
JULEA installation. For instance, the item client would manage all its data and
metadata within the item namespace while the POSIX client would use the
posix namespace. This provides flexibility that is currently not available with
many existing file systems. Currently, JULEA contains the following clients from
which applications and libraries can choose depending on their requirements:
– The object client provides direct access to JULEA’s data store and is able
to access arbitrary namespaces. It provides abstractions for single-server and
distributed objects that can be used by other clients; this allows other clients
to focus on their respective functionalities.
– The kv client provides direct access to JULEA’s metadata store and is able to
access arbitrary namespaces. It provides an abstraction for key-value pairs.
As with the object client, this allows other clients to make easy use of this
– The item client provides a cloud-like interface that supports collections and
items. Collections are the top-level entity and can contain only items, which
results in a relatively flat hierarchy. Both collections and items can be listed
using iterators. Items can be distributed over the available data servers using
JULEA’s distributions; the client makes use of the object client’s distributed
object abstraction and the kv client’s key-value abstraction to achieve this.
– The posix client implements a POSIX file system using the FUSE framework
on top of JULEA. It currently uses the item client but will be migrated to
the object and kv clients.
JULEA: A Flexible Storage Framework for HPC
Backends determine how data and metadata operations are handled. They are
completely transparent from the client point of view and can be exchanged using
the configuration. Backends can be either client-side or server-side, which causes
them to be loaded and used by JULEA’s clients and servers, respectively. Due
to the standardized backend interface, additional backends can be implemented
easily. JULEA already contains the following backends:
– The posix server-side data backend provides compatibility with existing
POSIX file systems. Due to using a full-featured file system as the data backend, certain functionalities – such as path lookup and permission checking –
can be duplicated within the I/O stack depending on the used client.
– The gio server-side data backend uses the GIO library that provides a modern, easy-to-use VFS API supporting multiple backends of its own, including
POSIX, FTP and SSH. It is mainly intended as a proof of concept and allows
experimenting with GIO’s more exotic backends.
– The lexos server-side data backend uses LEXOS to provide a light-weight data
store. LEXOS has been designed and implemented in [16] and only provides
basic I/O operations suited for an object store.
– The null server-side data backend is intended for performance measurements
of the overall I/O stack. It excludes the influence of underlying storage hardware by returning dummy information and discarding all incoming data.
Operations are still sent to the appropriate servers to allow measurements
of JULEA’s network components.
– The leveldb server-side metadata backend uses LevelDB for metadata storage. Due to JULEA’s metadata interface and LevelDB’s interface being very
similar, the backend is a relatively thin wrapper.
– The mongodb client-side metadata backend uses MongoDB and maps keyvalue pairs to documents using appropriate indexes to speed up operations.
In contrast to server-side backends, the connections to the MongoDB servers
are handled by the MongoDB C driver [15].
JULEA’s performance heavily depends on the used data and metadata backends.
For this reason, this section will focus on some general performance aspects. The
local results have been generated on a desktop machine (Intel Xeon E3-1225v3)
with a consumer SSD (Samsung SSD 840 EVO). The remote results have been
measured with two dual-socket nodes (Intel Xeon X5650) with an HDD (Seagate
Barracuda 7200.12) that are connected via Gbit Ethernet; one node has been
used as a client and one node has been used as a server.
Table 1 shows performance results for the posix and null data backends as well
as the leveldb and mongodb metadata backends. The local and remote results
M. Kuhn
Table 1. Performance of different data and metadata backends
Operation Perf. (local) Perf. (remote)
Metadata LevelDB
MongoDB Put
41,500 ops/s
43,000 ops/s
7,500 ops/s
8,000 ops/s
differ significantly because the remote results are limited by the Ethernet network’s high latency.2 While the posix data backend already shows satisfactory
throughput, the null data backend shows the maximum throughput of JULEA’s
current implementation. The performance difference is even more pronounced
when looking at the metadata backends: While the leveldb metadata backend
almost achieves the maximum throughput possible, the mongodb one is significantly slower. Is is important to note that these numbers were generated using
JULEA’s built-in benchmark suite by simply using a different configuration file
via the JULEA CONFIG environment variable. Additionally, the benchmark suite
currently uses only a single thread, that is, performance is likely better in realworld applications using multiple threads.
Table 2 shows performance results of different safety semantics when using
the leveldb metadata backend. As mentioned previously, the safety semantics
can be used to specify how safely data and metadata should be handled. The
none setting provides no guarantees and does not even check whether the data
has reached the servers; the network setting (which is the default) guarantees
that data has reached the servers; the storage setting guarantees that data has
been written to persistent storage. The safety semantics is handled implicitly by
the storage framework but clients are free to override it. This can be used to
considerably decrease overhead depending on applications’ safety requirements.
Even though the presented results only highlight a few key aspects of
JULEA’s design, it can be seen that the framework is able to handle a multitude of different use-cases due to its flexibility. Adding new backends requires
only a small amount of code3 and is easy due to JULEA’s clearly defined plugin architecture. Moreover, being able to adapt the semantics allows satisfying
different requirements and tuning performance according to them.
The network’s round-trip time is 0.110ms, which results in a maximum of 9,090
The existing backends are between 200 and 400 lines of code each.
JULEA: A Flexible Storage Framework for HPC
Table 2. Performance of the LevelDB backend with different safety levels
Operation Perf. (local)
Perf. (remote)
225,000 ops/s 62,000 ops/s
197,000 ops/s 59,500 ops/s
Network Put
41,500 ops/s
43,000 ops/s
4,300 ops/s
4,300 ops/s
29,500 ops/s
31,000 ops/s
3,800 ops/s
3,900 ops/s
Related Work
OrangeFS (formerly known as PVFS) is a user-level parallel file system [4,7]. Its
Trove layer abstracts the underlying storage technologies and currently supports
arbitrary POSIX file systems for data and BDB for metadata. There are currently projects to allow using LMDB and Cassandra for metadata. The metadata
backend to use has to be specified at configure time; JULEA allows configuring
the metadata backend using its configuration file. Additionally, JULEA’s backends do not have to be integrated into the storage framework but can be built
Ceph has gone through different underlying storage technologies [20]. In the
past, it used EBOFS, a custom low-level object store. Current versions support arbitrary POSIX file systems but due to requirements regarding extended
attributes only XFS, btrfs and ext4 are properly supported. Future versions will
also support BlueStore, a custom file system built specifically for Ceph.
Lustre is a kernel file system and only provides a POSIX interface. There is
work underway to establish DAOS, which is based on Lustre and offers interfaces
for containers, key-value pairs, multi-dimensional arrays and blobs [12]. This
will be used to provide an HDF5 interface directly on top of these interfaces.
DAOS’s approach is very similar to JULEA but has different goals. While DAOS
is meant to be a production exascale storage system, JULEA’s goal is to provide
a convenient framework for research and teaching that can be used to evaluate
the functionality and performance of new concepts. These concepts can then be
integrated into production systems if deemed successful.
Conclusion and Future Work
JULEA provides a flexible storage framework and contains all the necessary
building blocks to facilitate rapid prototyping and evaluation of different storage technologies. It has few dependencies and can be used without system-level
access, making it a good candidate for research and teaching.
While the basic storage framework and some initial backends have been finished, more work remains to be done. First, to investigate the potential benefits
M. Kuhn
of separating metadata and data of high-level data formats, we will implement an
HDF5 VOL plugin that makes use of JULEA. While the actual data (datasets)
will be stored using the data backend, everything else will be handling by the
metadata backend, enabling efficient access to structural namespace information
and attributes. We expect this approach to provide interesting insights because
the current I/O stack causes HDF5 metadata access to be handled by the file
systems’ data servers, which are usually not tuned for these specific access pattern. Second, we will further extend JULEA’s backend support. Specifically, we
will add a data backend for Ceph’s RADOS. This will allow both easy integration of JULEA into existing Ceph environments and facilitate comparison of
different approaches found in RADOS and JULEA’s distribution functionality.
These additional data and metadata backends will lead to further improvements
to JULEA’s backend interface, which will allow it to remain stable in the foreseeable future and provide a reliable base for third-party backends.
1. Al-Kiswany, S., Gharaibeh, A., Ripeanu, M.: The case for a versatile storage system. Operating Syst. Rev. 44(1), 10–14 (2010).
2. Albadri, N., Watson, R., Dekeyser, S.: TreeTags: bringing tags to the hierarchical
file system. In: Proceedings of the Australasian Computer Science Week Multiconference, Canberra, Australia, 2–5 February 2016, p. 21 (2016).
3. BigStorage: Storage-Based Convergence Between HPC and Cloud to Handle Big
Data (2017). Accessed Mar 2017
4. Carns, P.H., Ligon III, W.B., Ross, R.B., Thakur, R.: PVFS: A parallel file system for linux clusters. In: 4th Annual Linux Showcase and Conference, Atlanta,
Georgia, USA, 10–14 October 2000.
5. ESiWACE: Centre of Excellence in Simulation of Weather and Climate in Europe
(2017). Accessed Mar 2017
6. GLib: GLib Reference Manual (2017).
Accessed Mar 2017
7. Gu, P., Wang, J., Ross, R.: Bridging the gap between parallel file systems and
local file systems: A case study with PVFS. In: 2008 International Conference on
Parallel Processing, ICPP 2008, 8–12 September 2008, Portland, Oregon, USA, pp.
554–561 (2008).
8. Kuhn, M.: A semantics-aware I/O interface for high performance computing. In:
Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds.) ISC 2013. LNCS, vol. 7905, pp.
408–421. Springer, Heidelberg (2013). doi:10.1007/978-3-642-38750-0 31
9. Kuhn, M.: Dynamically adaptable I/O semantics for high performance computing.
In: Kunkel, J.M., Ludwig, T. (eds.) ISC High Performance 2015. LNCS, vol. 9137,
pp. 240–256. Springer, Cham (2015). doi:10.1007/978-3-319-20119-1 18
10. Kuhn, M.: Dynamically adaptable I/O semantics for high performance computing. Ph.D. Thesis, Universität Hamburg (2015).
JULEA: A Flexible Storage Framework for HPC
11. Kuhn, M., Kunkel, J.M., Ludwig, T.: Dynamic file system semantics to enable
metadata optimizations in PVFS. Concurrency Comput. Pract. Experience 21(14),
1775–1788 (2009).
12. Lofstead, J.F., Jimenez, I., Maltzahn, C., Koziol, Q., Bent, J., Barton, E.: DAOS
and friends: a proposal for an exascale storage system. In: Proceedings of the
International Conference for High Performance Computing, Networking, Storage
and Analysis, SC 2016, Salt Lake City, UT, USA, 13–18 November 2016, pp. 585–
596 (2016).
13. Matri, P., Costan, A., Antoniu, G., Montes, J., Pérez, M.S.: Týr: blob storage
meets built-in transactions. In: Proceedings of the International Conference for
High Performance Computing, Networking, Storage and Analysis, SC 2016, Salt
Lake City, UT, USA, 13–18 November 2016, pp. 49:1–49:12 (2016). http://dl.acm.
14. MongoDB, I.: Libbson: A BSON utility library (2017).
mongodb/libbson. Accessed Mar 2017
15. MongoDB, I.: Libmongoc: A high-performance MongoDB driver for C (2017). Accessed Mar 2017
16. Schröder, S.: Design, Implementation, and Evaluation of a Low-Level Extent-Based
Object Store. Master’s Thesis, Universität Hamburg (2013)
17. Seltzer, M.I., Murphy, N.: Hierarchical file systems are dead. In: Proceedings of
HotOS 2009: 12th Workshop on Hot Topics in Operating Systems, 18–20 May
2009, Monte Verità, Switzerland (2009).
tech/full papers/seltzer/seltzer.pdf
18. Stender, J., Kolbeck, B., Hupfeld, F., Cesario, E., Focht, E., Hess, M., Malo, J.,
Martı́, J.: Striping without sacrifices: Maintaining POSIX semantics in a parallel
file system. In: Proceedings of the First USENIX Workshop on Large-Scale Computing, LASCO 2008, 23 June 2008, Boston, MA, USA (2008). http://www.usenix.
org/events/wiov08/tech/full papers/stender/stender.pdf
19. Vilayannur, M., Nath, P., Sivasubramaniam, A.: Providing tunable consistency
for a parallel file store. In: Proceedings of the FAST 2005 Conference on File
and Storage Technologies, 13–16 December 2005, San Francisco, California, USA
20. Weil, S.A., Brandt, S.A., Miller, E.L., Long, D.D.E., Maltzahn, C.: Ceph: A scalable, high-performance distributed file system. In: 7th Symposium on Operating
Systems Design and Implementation (OSDI 2006), 6–8 November 2006, Seattle,
WA, USA, pp. 307–320 (2006).
21. Wright, C.P., Spillane, R.P., Sivathanu, G., Zadok, E.: Extending ACID semantics
to the file system. TOS 3(2), 4:1–4:42 (2007).
Без категории
Размер файла
700 Кб
978, 319, 67630
Пожаловаться на содержимое документа