Present and future of data cubes: an European EO perspective

OpenGeoHub
14 min readJan 26, 2023

--

Prepared by: Carson Ross (OpenGeoHub), Tom Hengl (OpenGeoHub), Leandro Parente (OpenGeoHub), Vasile Crăciunescu (TerraSigna)

Data Cubes are highly organised data infrastructures enabling users to run new analyses and generate insights into processes, patterns and trends. In the context of Earth Observation (EO) projects, data cubes are commonly time-series of spatiotemporal images, or long lists of measurements or predictions of biophysical variables coming from station data (points). We have invited some of the key European developers of modern Data Cubes to the open discussion forum entitled “Present and Future of Data Cubes” at the EuroGEO workshop 2022 in Athens. Questions addressed include: Which Open Source solutions provide cutting edge functionality to process EO data and how to make EO data processing more transparent and more reproducible? Which commercial Data Cube (covering the whole of EU / whole of world / majority of global EO data) services currently exists and what is the best way to use them for projects? Which cloud-native data formats are applicable to store and share gridded, vector and tabular data? Which web-mapping tech / web-clients do you usually use to expose / visualise your data cubes? How can we all make environmental data more usable, accessible and more relevant? The videos of the talks are available for watching via our TIB channel. Presentations include demos of functionality and proposals for the future development work.

Data Cubes discussion forum at EuroGEO

Critical environmental information is heavily under-utilized because it requires a high level of expertise and computing capacity. EO data is not yet a commodity and neither is environmental information, which has led to a fragmented data space defined by a seemingly endless production of new tools and services that can’t interoperate and aren’t accessible by people outside of the deep tech community (read more). One way to make EO and environmental data more usable is to organize them through Cloud data services as Data Cubes. A “Data Cube” is typically a multiarray, multi-thematic, analysis-ready collection of data; in the case of environmental sciences, a Data Cube is often a time-series of images (rasters) or long lists of measurements or predictions of biophysical variables coming from station data (points) and with explicit geographical coordinates and time reference (begin, end of measurements). In the “Present and Future of Data Cubes” discussion forum organised by HorizonEurope’s Open-Earth-Monitor project and OpenGeoHub, international experts and project partners discussed how we can use existing technology to make environmental data cubes derived from Earth Observations and field monitoring more usable, accessible and relevant to decision-makers and society and what to expect in the near and longer-term future?

Jumping off from a discussion of existing top-level solutions for the effective use of large EO data for real-world applications, we specifically focused on open-source solutions that can process massive volumes of EO data in a more transparent and reproducible way. On top of public services, commercial solutions can play a massive role, and so we also analyse and discuss existing commercial Data Cube services covering the whole of Europe and the world to understand what is the best way to use them. Finally, the forum gave stage to more in-depth discussions around cloud-native data formats and their applicability to store and share gridded, vector and tabular data, as well as the importance of licences governing the way data can be shared and used.

Presentations included in this discussion forum include:

European projects supporting building and serving of Data Cubes

Data cubes are the foundation of an increasing number of European projects. The tools presented at the Euro GEO discussion forum illustrated the depth of the variety of uses to which data cubes can be applied, such as monitoring agriculture, biodiversity, urban climate change, and water resources. They also included representation from projects in every stage of development, from concept phase such as B3 (“B-Cubed”), a set of data cubes intended to aggregate different sources of biodiversity data to be interoperable with remote sensing data, to publicly available products, like EcoDataCube.eu, which maps land cover changes at high spatial resolution with gap filling through Machine Learning algorithms.

Video Presentation of the B3 Project’s Data Cube. View the remaining presentations and full discussion forum here.

Presenters and participants had the opportunity to hear about and evaluate the pros and cons of different back end technologies and data formats for different uses such as web-mapping, data visualization, and the sharing of meta-data. Using Slido.com, participants were asked to respond anonymously to polls and word clouds on these topics, as well as to ask questions to the presenters. Out of 70 participants, 57 engaged with the polls and online Q&A, with an average of 24 responses to each question. Poll results can be seen below.

Poll results: which of the following data services do you access at least once per year?
Poll results: which scalable solution do you usually use to distribute multiarray data?

Priorities for Data Cubes evolution

Users and developers discussed some of the main trends in the evolution of data cubes and best practices moving forward, such as how to overcome bottlenecks, and key technologies to improve efficiency and accessibility. Some key poll results are provided below.

Poll results: what are the main bottlenecks of using EO data for data science projects in your organisation? (1/2)
Poll results: what are the main bottlenecks of using EO data for data science projects in your organisation? (2/2)
Poll results: what should be the priority for the data cube evolution?
Poll results: what do you think will be the key technology for the future of data cubes?

ARCO and 4C’s

For state-of-the-art data cubes it is important to emphasize the following term “ARCO” = Analysis-Ready Cloud Optimized (Stern et al., 2022). The meaning of this term is explained below. Most important to note about ARCO is that, unlike data systems from a decade ago, modern data cubes should ideally be Cloud-native (meaning: ready for fast and efficient web-services / scalable applications / API’s) and pre-processed so that they can be directly used for modelling and eventually for decision-making.

Not all data is the same. In the most generic terms, every project starts with raw data, which comes from observations and measurements i.e. it is usually directly downloaded from instruments. It can be gradually “enriched” so the typical hierarchy of data is thus:

  • Raw data ↓
  • Cleaned data ↓
  • Analysis-ready data ↓
  • Decision-ready data ↓
  • Decisions.

For example, vector maps of roads of an area coming from different sources is the raw data. These can be cleaned to remove artifacts and/or outdated elements. Once the cleaned data is further converted to layers that serve specific GIS operations, it can be considered analysis-ready. The next version of this data is the decision-ready data which, for example, tells drivers where to turn and which road is one-way.

In order to maximize usability of data, it is important to have data at least at the level of “analysis-readiness”. Only the most specialized users would require the original raw version of data; the vast majority of users typically cannot afford pre-processing of raw data, especially the EO raw images that can be Petabytes in size. Users interested in developing applications on top of the data especially require that the data is analysis-ready.

What does “analysis-ready” mean exactly? As a rule of thumb, there are four common criteria that define whether data is analysis-ready (the so-called “4C”): (1) completeness (meaning: data is available for all or at least 99% of pixels of interest), (2) consistency (meaning: consistent file names, variable names and relationships, and everything is documented via metadata), (3) currency (in this context: the user is using the most up-to-date version of the data), (4) correctness (in this context: making sure that the served data is the most correct / highest possible quality version). In other words, ARCO version of the data should be something that is highest quality, fully documented and optimized for web-services / advanced analysis.

How to build a modern data cube?

Now that you understand the ARCO and 4C concepts, here are a few important tips to help you organise your own data into data cubes. First, select a data cube solution i.e. storage option / data format for your data. The following modern options were mentioned at the discussion forum:

  1. Cloud Optimised GeoTIFFs (COGs) stored on S3 (Simple Storage Service) or similar;
  2. ZARR format files stored on S3 or similar;
  3. Network Common Data Form (NetCDF);
  4. Array DBMS (e.g. used in Rasdaman);

For tabular and vector data (points, lines, polygons), currently there is no cloud-native format able to provide the full level of functionalities available through COGs and ZARR, however the following options are the the most promising:

  • FlatGeobuf (spatial filter support for remote files);
  • Geoparquet (currently without spatial optimizations — #13),
  • PMTiles (organised by tile and optimized for visualization)

The second important step is to select a solution to “publish your data” i.e. provide data through a web-service and via visualization tools (“view and explore data”). Typical steps include:

  1. Prepare your data in some Cloud-native format, analysis-ready and fully documented, a consistent file naming convention, spatial resolutions, bounding box etc.
  2. Upload your data to a server with a storage service able to provide HTTP range requests (e.g. S3 and/or Zenodo.org).
  3. Register metadata in standardized catalog (e.g. GeoNetwork, STAC).
  4. Register your data on some public repository (e.g. STAC Index, data.europa.eu).
  5. Provide an interactive web-GIS interface to data (your own hosting).
  6. Provide a tutorial / computation notebook on how the data was produced and how to use; include a disclaimer, licence, technical documentation etc.

Here a few easy-to-use open source solutions fit for the purpose:

  1. GeoNetwork, GeoNode, pyCSW and STAC (SpatioTemporal Asset Catalogs) to register metadata for geospatial assets (raster and vector) and serve them;
  2. Leaflet + rio-tiler to visualise geospatial data;
  3. XCUBE + Viewer App out-of-box solution to create a web-GIS, especially suitable for time-series of images;
  4. leafmap a python package for geospatial analysis and interactive mapping in a Jupyter environment;
  5. Rasdaman (Raster data manager);

If you are looking for publicly available (open data) data cubes you might check some of the popular portals covering world / Europe (unsorted):

As of recently, Zenodo.org supports http range requests, hence if you put a copy of COG’s on Zenodo.org (publicly funded hosting) these can be accessed in QGIS or similar (read more about how to open and use COG’s in QGIS).

Some key publications of interest on the topic of Data Cubes include MDPI Special Issue “Earth Observation Data Cubes” and the book Big Data Analytics in Earth, Atmospheric and Ocean Sciences. Some introductory papers you might want to consider reading include:

The importance of data licence and usage rights, a discussion

This forum featured presentations from a series of open-source, European Commission funded projects, as well as EuroDataCube.com, a commercial software encompassing apps, APIs and libraries for EO data processing, developed by Sinergise and similar commercial businesses. This dichotomy sparked some debate, both during and after the event, on the importance of open licences, standards and APIs. A summary of this discussion is provided below.

In our opinion, it’s crucial that before any serious collaboration or co-development begins, the target licences must be set, for example, the use of compatible licences, preferably open source (e.g. https://opensource.org/licenses) and open data licences (e.g. https://opendatacommons.org/licenses/). This makes “a level playing field” and reduces potential bureaucracy.

However, Grega Milcinski (Sinergise, developer of EuroDataCube) noted that open-source software is not the mandatory prerequisite, what is more important is that:

  1. Software (service) is reusable — i.e. it is possible to run processes with some other setting (different AOI, time period);
  2. APIs are well defined and usable (they don’t necessarily need to be “standard” — OpenEO is an example of a community-accepted and now commonly used API, whereas OGC WCS 2.0 standard is very rarely used);
  3. APIs are operational, in a “self-service” mode (bespoke solutions introduce too much of friction, making it impossible to try something at small scale, therefore almost never coming to the scale large enough to justify the overheads); and
  4. It is straightforward and feasible to inter-change one operational API for another (to prevent lock-in) — OpenEO shines here as this is built-in by design.

Grega sees EO processing capabilities as similar to cloud infrastructure. AWS, GCP, Azure, CreoDIAS, for example, are not open-source, nor are they “standard”. Yet nobody feels locked-in by technology. It can be a nuisance to move from one cloud to another (technologically speaking), but it is possible. Docker containers probably play an important role in this.

However, he notes the “open-source” part of the software has the following (potential) benefits:

  1. Transparency, as it can be thoroughly validated (no black box) — but there needs to be someone capable/willing to do that;
  2. Off-the-shelf capability (GDAL is a popular example, commonly used; on the other hand, sen2cor is not fully open-source, yet it still can be commonly used);
  3. Possibility to build on top of it (SNAP is an example).

And yet, he caveats, to achieve the above it is not sufficient to just mark it as open source. The software must also be developed and designed with open-source in mind. Sen4CAP is an example — it was built from the start as open-source, mandated by ESA, and hence anyone can extend and improve. Yet, it is such a complex piece of software that whoever wants to introduce any modifications (“building on top of it”), essentially needs to involve the software’s author.

Additionally, Grega emphasizes that with big data APIs, there is often as much value in the “dev-ops” as in the software itself. Taking the software will require a huge investment to make it work properly–something that most of the small players cannot afford. Big ones can: AWS is benefiting a lot from these concepts.

To summarize, Grega believes that the data does not always have to be free: it costs to create, it costs to maintain, it costs to distribute, and it costs to use it, and many users prefer professional, maintained, quality-controlled services. Some parts of those costs can be covered by community (i.e. Horizon Europe, Copernicus), and the cost of data should be orders of magnitude smaller than the cost of added value one can create with this data. It’s all relative. What is clear, however, is that licences, both for data and the software (use) should be clear and agreed up front.

Erwin Goor of the European Commission (Project Advisor, Environmental Information) says that the crucial element is to design with a clear long-term exploitation path in mind. If you’re successful, the tools/solutions will be brought into operations later on (by you, by your partners, or by a third party). You need to ensure now (including with clear agreements on licences, if any, in your consortium agreement), that you don’t run into surprises. Common examples of this are excessive or unexpected licence costs if you need to scale up later, or issues for further development.

For Quentin Groom of the B³ project, the issue of data licences is particularly important due to the unique nature of biodiversity data. Biodiversity data are shared in GBIF using a licence waiver (CC0) or one of two Creative Commons licences (CC-BY or CC-BY-NC). In the EU, database copyright only covers the structure of the database and not the contents, but the contents may have Sui Generis rights, though that is not clear and possibly varies with the method used to collect the data. There is certainly a lack of clarity of what the copyright position is and what these licences mean legally, nevertheless, you also don’t want to aggravate data providers by using data in a manner that they might not have intended.

B³ will create derivatives/models of the raw data and can do so under any of these licences because the project is doing so non-commercially. Nevertheless, the project strives to create these derivatives/models so that anyone can use them, including commercial organisations. Yet, information on the licensing of derivatives and models is incredibly difficult to find, and it’s unclear if they would be covered by database copyright, nor Sui Generis rights.

The safest option seems to be to create data cubes with or without CC-BY-NC data included. The issue is not just a legal one, it is about being fair to data providers and data users, with the intention of complying with the intentions of the providers and users, even though it’s impossible to know their intentions for sure. It is likely most data providers and users have no idea about the legality of data reuse.

Quentin goes on to recommend the below paper for useful information on copyright, Sui Generis and licences:

In it, they say “How counterintuitive this may sound, data that are created are not protected by Sui Generis Database Right, only data that are obtained.” Therefore, data from models (which are created) is not copyrightable and do not have Sui Generis rights. Of course, this doesn’t solve any ethical issue there might be in reusing data, but it does clarify the legal position.

Other questions popped up during the forum that merit further discussion as well, for example: which platforms should we use to register data cubes and how can we set up an independent impact/traffic system? How do we track impact and usage? We have tried to provide some practical tips on how to start with Data Cubes and achieve ARCO / 4Cs. Let us know in the comments below if you want to know more!

Acknowledgements

The Open-Earth-Monitor Cyberinfratructure project has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement № 101059548.

--

--

OpenGeoHub

Not-for-profit research foundation that promotes open geographical and geo-scientific data and develops open source software.