Use Cases, Software Architecture and Deployment Patterns

There are multiple use cases for Arkisto which we will document in the abstract, in addition to the specific case studies we're already working on. Due to its standards-based and extensible nature, Arkisto can realise the goal of making data FAIR (Findable, Accessible, Interoperable, Re-usable).

The (mythical) minimal Arkisto platform

The simplest possible Arkisto platform deployment would be a repository with some objects in it. No portal, no data ingest or preservation service, eg:

An OCFL repository containing two RO-Crate data objects

That might not seem very useful in itself but a standards-based repository is a great candidate for data preservation - it puts the I (Interoperable) in FAIR data! It can also provide a basis for re-activation via services that can Reuse the data by making Findable and Accessible. Because of Arkisto's use of Standards, the repository is the core to which services can be added.

Adding a web portal

To make the data Findable - the F in FAIR Data - just add one of the Arkisto Portal tools - this requires some configuration but significantly less than building a service from scratch.

Adding a portal with an indexer, configuration and authorization service

Data ingest pathways

But how does data get into an OCFL repository? There are several patterns in production, in development and planned.

A snapshot repository

Exporting a repository to RO-Crate can be a matter of writing a script to interrogate a database, convert static files, or otherwise traverse an existing dataset.

This pattern is used by the 2020 snapshot of ExpertNation - where we were given an XML file exported from Heurist and used a script to convert that data to the RO-Crate format. This RO-Crate can in turn be deposited in a repository - in this case the UTS data repository - and served via a portal, preserving the data while at the same time making it Accessible.

A conversion script converts existing objects in a repository to RO-Crate format and deposits them in an OCFL repository

A growing cultural collection

In PARADISEC, research teams add collections and items using bespoke tools that were written for the archive. This is transitioning to a model where datasets will be described using Arkisto tools, particularly Describo.

A researcher describes data using Describo and deposits it in the PARADISEC repository

Field data capture

Data from sensors in the field is often streamed directly to some kind of database with or without portal services and interfaces. There are multiple possible Arkisto deployment patterns in this situation. Where practical, the UTS eResearch team aims to take an approach that first keeps copies of any raw data files and preserves those. The team then builds databases and discovery portals from the raw data, although this is not always possible.

This diagram shows an approximation of one current scenario we're working with at UTS:

Sensor data capture via a vendor-supplied database, with a custom script to convert monthly reading

We are working to supply the research team in question with an OCFL repository, add a portal, and create additional data-capture pathways for genomics data and potentially more sensors. We could then to move on to adding analytical tools, such as a dashboard that shows plots of sensor readings.

Analytical and housekeeping tools

So far on this page we have covered simplified views of Arkisto deployment patterns with the repository at the centre, adding discovery portals and giving examples of data-acquisition pipelines (just scratching the surface of the possibilities). These things in themselves have value: making sure data is well described and as future proof as possible are vitally important but what can we DO with data?

OCFL + RO-Crate tools

Having data in a standard-stack, with OCFL used to lay-out research data objects and RO-Crate to describe them, means that it is possible to write simple programs that can interrogate a repository. That is, you don't have to spend time understanding the organisation of each dataset. The same idea underpins Arkisto's web-portal tools: standardization reduces the cost of building.

  • Validators and preservation tools: there are not many of these around yet, but members of the Arkisto community as well as the broader OCFL and RO-Crate communities are working on these; expect to see preservation tools that work over OCFL repositories in particular.

  • Brute-force scripts: for moderate-sized repositories, it is possible to write a scripts to examine every object in a repository and to run analytics. For instance, it would be possible to visualise a number of years' worth of sensor readings from a small number of sensors or to look for the geographical distribution of events in historical datasets. We're working on several such use-cases at UTS at the moment.

Simple brute-force script to draw a graph can visit every object in OCF

Adding databases and other indexes

For larger-scale use, visiting every object in a repository can be inefficient. In these cases, using an index means that an analysis script can request all the data in a particular time-window or of a certain type - or any other query that the index supports. The Arkisto Portal tools are built around general purpose indexes that can do full-text and facetted searching, with the potential to support either human or machine interfaces.

Adding an index can improve script performance

While the index engines used in our current portals are based on full-text search and metadata, we expect others to be built as needed by disciplines using, for example, SQL databases or triple stores.

Analysis tool integration

Above, we have looked at how people or machines can access an Arkisto platform deployment by querying the repository, either directly or via an index. However, there is a much larger opportunity in being able to integrate Arkisto deployments with other tools in the research landscape. To take one example, text analysis is in great demand across a very wide range of disciplines. This hypothetical scenario shows the potential for a researcher to use a web portal to locate datasets which contain text and then send the whole results set to a an analysis platform, in this case an interactive Jupyter notebook.

Researcher uses a web portal to locate datasets "Show me all records with attached text from 1950-196" and send the subsequent results set to a Jupyter notebook in CloudStor Swan

Arkisto already allows re-use of visualisation tools and viewers that can be embedded directly in a portal. We are planning a "widget store" that will enable researchers to choose and deploy a number of add-ons to the basic portal.

Institutional and discipline repositories

One of the major use case deployment patterns for Arkisto is to underpin an institutional data repository / archive function, see the UTS data repository for an established example.

In this use case, data is ingested into the repository via a research data management system which talks to the OCFL repository, not the portal. There is a clear separation of concerns: the portal's job is to provide controlled access and search services via an index, the OCFL repository keeps version controlled data on disc, and the Research Data Management System handles deposit of data.

Manual archiving

Researcher deposits a dataset into an OCFL repository

At UTS the Research Data Management system in use is RedBox - an open source platform for managing data across the research process from research data management planning (via Research Data Management Plans (RDMPS)) to archiving and re-use of data. ReDBox has services for provisioning and/or tracking Research Workspaces, which are sites where research data collection and management. All of the data acquisition scenarios described above would qualify as Research Workspaces, as do file-shares on institutional storage or share-sync services such as CloudStor, as well as survey platforms, project management and version control systems such as Gitlab and Github.

Fetching data from Research Workspaces

Soon at UTS researchers will be able to use the Research Data Management System to select a workspace for archiving, add metadata as appropriate, and the system will deposit the data for them.

Coming soon: Researcher deposits a dataset by taking a snapshot of a dataset (or part thereof) in a Research Workspace

Publishing data

The UTS institutional data repository actually has two parts: an internal behind-the-firewall archive, with access control - to ensure that only authorized people can access data - and an external data portal for publicly accessible data. This architecture reduces the risk of data breaches by not allowing access through the firewall to sensitive or confidential data until secure tools are available to allow extra-institutional access.

Researchers can select a subset of an archived dataset to be published, or publish an entire dataset.

A "bot" "notices" that a new public dataset is available and copies it to the public repository, where it will be indexed and made available through the data portal.

A researcher can select a subset of data and have it published on the Public Data portal

NOTE: There is no monolithic "Repository Application" that mediates all interactions with the file-base OCFL store but a set of services which operate independently. This does mean that processes must be in place to ensure that there is not file-contention, with two bits of software trying to update an object at the same time.

In this section