Use Cases, Software Architecture and Deployment Patterns

There are multiple use cases for Arkisto which we will document in the abstract, in addition to the specific case studies we're already working on. Due to its standards-based and extensible nature, Arkisto can realise the goal of making data FAIR (Findable, Accessible, Interoperable, Re-usable).

The (mythical) minimal Arkisto platform #

The simplest possible Arkisto platform deployment would be a repository with some objects in it. No portal, no data ingest or preservation service, eg:

An OCFL repository containing two RO-Crate data objects

That might not seem very useful in itself but a standards-based repository is a great candidate for data preservation - it puts the I (Interoperable) in FAIR data! It can also provide a basis for re-activation via services that can Reuse the data by making Findable and Accessible. Because of Arkisto's use of Standards, the repository is the core to which services can be added.

Adding a web portal #

To make the data Findable - the F in FAIR Data - just add one of the Arkisto Portal tools - this requires some configuration but significantly less than building a service from scratch.

Adding a portal with an indexer, configuration and authorization service

Data ingest pathways #

But how does data get into an OCFL repository? There are several patterns in production, in development and planned.

A snapshot repository #

Exporting a repository to RO-Crate can be a matter of writing a script to interrogate a database, convert static files, or otherwise traverse an existing dataset.

This pattern is used by the 2020 snapshot of ExpertNation - where we were given an XML file exported from Heurist and used a script to convert that data to the RO-Crate format. This RO-Crate can in turn be deposited in a repository - in this case the UTS data repository - and served via a portal, preserving the data while at the same time making it Accessible.

A conversion script converts existing objects in a repository to RO-Crate format and deposits them in an OCFL repository

A growing cultural collection #

In PARADISEC, research teams add collections and items using bespoke tools that were written for the archive. This is transitioning to a model where datasets will be described using Arkisto tools, particularly Describo.

A researcher describes data using Describo and deposits it in the PARADISEC repository

Field data capture #

Data from sensors in the field is often streamed directly to some kind of database with or without portal services and interfaces. There are multiple possible Arkisto deployment patterns in this situation. Where practical, the UTS eResearch team aims to take an approach that first keeps copies of any raw data files and preserves those. The team then builds databases and discovery portals from the raw data, although this is not always possible.

This diagram shows an approximation of one current scenario we're working with at UTS:

Sensor data capture via a vendor-supplied database, with a custom script to convert monthly reading

We are working to supply the research team in question with an OCFL repository, add a portal, and create additional data-capture pathways for genomics data and potentially more sensors. We could then to move on to adding analytical tools, such as a dashboard that shows plots of sensor readings.

Analytical and housekeeping tools #

So far on this page we have covered simplified views of Arkisto deployment patterns with the repository at the centre, adding discovery portals and giving examples of data-acquisition pipelines (just scratching the surface of the possibilities). These things in themselves have value: making sure data is well described and as future proof as possible are vitally important but what can we DO with data?

OCFL + RO-Crate tools #

Having data in a standard-stack, with OCFL used to lay-out research data objects and RO-Crate to describe them, means that it is possible to write simple programs that can interrogate a repository. That is, you don't have to spend time understanding the organisation of each dataset. The same idea underpins Arkisto's web-portal tools: standardization reduces the cost of building.

  • Validators and preservation tools: there are not many of these around yet, but members of the Arkisto community as well as the broader OCFL and RO-Crate communities are working on these; expect to see preservation tools that work over OCFL repositories in particular.

  • Brute-force scripts: for moderate-sized repositories, it is possible to write a scripts to examine every object in a repository and to run analytics. For instance, it would be possible to visualise a number of years' worth of sensor readings from a small number of sensors or to look for the geographical distribution of events in historical datasets. We're working on several such use-cases at UTS at the moment.

Simple brute-force script to draw a graph can visit every object in OCF

Adding databases and other indexes #

For larger-scale use, visiting every object in a repository can be inefficient. In these cases, using an index means that an analysis script can request all the data in a particular time-window or of a certain type - or any other query that the index supports. The Arkisto Portal tools are built around general purpose indexes that can do full-text and facetted searching, with the potential to support either human or machine interfaces.

Adding an index can improve script performance

While the index engines used in our current portals are based on full-text search and metadata, we expect others to be built as needed by disciplines using, for example, SQL databases or triple stores.

Analysis tool integration #

Above, we have looked at how people or machines can access an Arkisto platform deployment by querying the repository, either directly or via an index. However, there is a much larger opportunity in being able to integrate Arkisto deployments with other tools in the research landscape. To take one example, text analysis is in great demand across a very wide range of disciplines. This hypothetical scenario shows the potential for a researcher to use a web portal to locate datasets which contain text and then send the whole results set to a an analysis platform, in this case an interactive Jupyter notebook.

Researcher uses a web portal to locate datasets "Show me all records with attached text from 1950-196" and send the subsequent results set to a Jupyter notebook in CloudStor Swan

Arkisto already allows re-use of visualisation tools and viewers that can be embedded directly in a portal. We are planning a "widget store" that will enable researchers to choose and deploy a number of add-ons to the basic portal.