22 january 2024
Traditional geospatial data management often involves on-premise infrastructure, making it challenging to scale resources as data volumes grow. Additionally, data formats and storage solutions can be rigid, causing difficulties in sharing and collaboration. In response to these limitations, geospatial data has seen a significant shift in its storage and processing to a "cloud-native" approach. In a cloud-native context, geospatial data is managed and processed leveraging cloud services — often abstracting away the complexities of infrastructure management. This shift to the cloud allows for a more dynamic resource allocation, ensuring that computing resources match the demands of geospatial applications.
Since the emergence of cloud computing, users can no longer be expected to download, store, and work with large files on their machines. Instead, they'll want to access large volumes of data over a network — in chunks — where this data must be made available via subsetting methods. Geospatial data is no exception: cloud-optimised data formats must cater for this. The optimal packaging will depend on the data type and the specific use case, and therefore no universally applicable format exists. Several cloud-optimized formats have emerged, all with their pros and cons for storing and serving geospatial data on the cloud.
Image source: guide.cloudnativegeo.org
Notwithstanding their diversity, a characteristic that all cloud-optimized formats have in common is that they contain metadata which includes addresses for data blocks. A cloud-native dataset is a dataset with small addressable chunks via files, internal tiles, or both. As a result, the dataset becomes accessible for parallelized and partial reading using HTTP range requests — making it highly compatible with object storage (a file storage alternative to local disk) as well.
To get a grip on how this could work, let's take a look at a Cloud-Optimized GeoTIFF (COG) and see how it enables more efficient workflows for raster data on the cloud. Here's what cogeo.org says on the topic:
Cloud Optimized GeoTIFF relies on two complementary pieces of technology.
The first is the ability of GeoTIFF’s to store not just the raw pixels of the image, but to organize those pixels in particular ways. The second is HTTP GET range requests, that let clients ask for just the portions of a file that they need. Using the first organizes the GeoTIFF so the latter’s requests can easily select the parts of the file that are useful for processing.
COGs are powerful because of how the data is structured internally. There are two crucial aspects of a COG that enable it to be cloud-optimized: Tiling and Overviews.
Concept of pyramidal TIFF visualized by www.kitware.com/deciphering-cloud-optimized-geotiffs
Tiling arranges the bytes of the image data in so-called tiles so that geographically close data are adjacent within the file. The metadata associated with the COG holds information (TileOffsets and TileByteCounts) about each of these tiles. Quick access to a certain area is thus made possible for HTTP range requests, so that only the portion of the file that needs to be read is accessed.
Overviews refer to reduced-resolution versions of the main raster image. The overviews are organized in a hierarchical structure, forming a pyramid of progressively lower resolutions. This pyramid of overviews allows for faster retrieval and display of the data at different zoom levels, optimizing performance in cloud-based geospatial workflows.
Strategic tiling and overviews puts the right structure on the GeoTIFF’s so that HTTP range queries can request just the part of the file that is relevant. Overviews prove valuable when rendering a quick image of the entire file. Instead of downloading every pixel, clients can efficiently request smaller, pre-existing overviews. Tiles play a role when processing or visualizing a specific area of the overall file. This could be within an overview or at full resolution. Regardless, tiles streamline the retrieval of relevant bytes from a file section, enabling the HTTP range request to acquire precisely what is needed.
The Cloud-Native approach
In the cloud-native approach, cloud-optimized files (e.g. COG, FlatGeobuf , GeoParquet, Zarr, Kerchunk, ...) reside in a scalable cloud object storage system (e.g. Amazon S3, Google Cloud Storage, Azure Blob Storage,...). A serverless function (e.g. AWS Lambda, Google Cloud Functions, Azure Functions, …) then dynamically utilizes cloud resources to process these files on-demand. This translates to a model of automatic scaling, ensuring efficient processing regardless of the number of requests or the file size. In this context, powerful tools and libraries (TiTiler, MVT, GeoPandas, Rasterio, Cogeo, ...) can be employed in a serverless manner to efficiently generate, serve and visualize cloud-optimized tiles from large geospatial datasets.
This approach comes with distinct advantages: scalability, as cloud-optimized formats on object storage support parallel read requests, simplifying the management of large datasets; reduced latency, where subsets of raw data are processed faster than traditional downloads; and flexibility, allowing users to customize data access and perform complex operations without the need to download entire datasets. These benefits collectively enhance the speed and adaptability of geospatial data processing, while the processed results remain accessible through APIs, facilitating seamless integration into a variety of applications.
In contrast, the traditional on-premise approach involves storing files on a local server within an organization's data center. Processing, in this case, demands manual intervention on a traditional server, necessitating resource allocation. Scaling resources for multiple processing requests can be cumbersome and entails a greater upfront investment in hardware. The processed results are then accessed directly from the server, with integration into applications requiring manual steps.
The benefits of the cloud-native approach thus become evident when considering scalability, latency and flexibility. Moreover, the pay-as-you-go model inherent in cloud-native platforms results in notable cost savings when compared to the upfront hardware investments required in traditional approaches.
Challenges and Considerations
Effectively managing geospatial data in the cloud presents some challenges. The distributed nature of cloud environments introduces complexities in maintaining uniform data quality & consistency. Simultaneously, the challenge of interoperability arises from the diverse formats in which geospatial data may exist. Standardizing data formats and protocols promotes seamless data exchange and integration.
High dependence on the network is another inherent challenge that can impact the efficient transfer and processing of geospatial data. Optimizing data transfer methods, minimizing unnecessary transfers, and leveraging edge computing where feasible can enhance overall network efficiency and reliability.
Cost management poses another challenge, and organizations should employ cost-monitoring tools, optimize data storage, and leverage tiered storage options to mitigate expenses.
Lack of knowledge and expertise in cloud-native geospatial technologies is a common challenge. For organizations navigating these challenges, Nazka Mapps offers support.
Looking into the future of cloud-native geospatial technologies, several emerging trends promise to reshape the landscape, presenting opportunities for transformative advancements in the industry. One notable trend is the increasing integration of artificial intelligence (AI) and machine learning (ML) into geospatial data analytics. These technologies offer the potential to unlock valuable insights from vast datasets, enabling more sophisticated analysis, pattern recognition, and predictive modeling.
Another key trend is the rise of edge computing in geospatial applications. By processing data closer to the source, edge computing minimizes latency and enhances real-time decision-making in applications such as autonomous vehicles, smart cities, and IoT devices. This shift towards decentralized processing aligns seamlessly with the distributed nature of cloud-native approaches, contributing to more responsive and scalable geospatial solutions.
Advancements in 3D geospatial visualization and augmented reality (AR) also mark a compelling trend. Integrating these technologies into cloud-native platforms offers immersive and interactive experiences, revolutionizing how users interact with geospatial data. This has implications across various industries, from urban planning and architecture to gaming and tourism.
Additionally, the evolution of standardization efforts in geospatial data formats and protocols contributes to a more interoperable and collaborative ecosystem. As more organizations embrace cloud-native approaches, interoperability becomes increasingly vital, to allow seamless data exchange and collaboration across different platforms and systems.
The ongoing expansion of cloud-native geospatial services, such as serverless computing, containerization, and microservices architectures, continues to empower organizations with scalable, cost-effective, and flexible solutions. These technologies enable efficient resource utilization, rapid deployment, and improved scalability, shaping the future of geospatial data management.
In conclusion, the future of cloud-native geospatial technologies holds exciting prospects. A future we’re looking forward to being part of.