Scalable Zonal Statistics Pipeline for Large Raster and Vector Datasets

Teacher: Giovanna Venuti

Supervisor: Ing. Gianluca Murdaca – MindEarth

Description: The goal of this project is to explore and prototype a scalable pipeline for computing zonal statistics on large geospatial datasets, combining raster and vector data in a reproducible and efficient way. The student will review and test existing approaches for zonal statistics, compare alternative implementation strategies, and assess their suitability for large-scale processing in a modern data engineering environment. As part of the project, the student will also test RaQuet as an emerging raster-in-Parquet approach and evaluate whether it can support or simplify raster-vector analytics workflows compared with more traditional formats and tools. The final deliverable will include a benchmark on selected datasets, a comparison of methods, and a proof of concept pipeline with recommendations for production adoption.

Difficult level: Medium

Requirements: Good knowledge of Python and geospatial data processing (pandas, geopandas). Familiarity with raster and vector formats (geotiff, parquet), performance evaluation, and data pipeline development is required. Experience with SQL engines, Parquet-based workflows, and distributed processing tools is a plus.

Scroll to Top