Research Proposal Example: GeoLifeCLEF 2024

CS8903: Special Problems Project Proposal

Name: Anthony Miyaguchi <acmiyaguchi@gatech.edu>
Student ID: amiyaguchi3 Date: 2023-11-08

Spatio-temporal species distribution estimation for GeoLifeCLEF 2024 with unsupervised representation learning of remote sensing data

Objective

The objective of the special problems is to solve the GeoLifeCLEF 2024 challenge and publish a working notes paper to the CLEF 2024 conference detailing the implemented system, in collaboration with student peers at Georgia Tech. The resulting scope of work is estimated to take 3-credit hours, or 150 hours of work, by the primary author.

Background and Motivation

CLEF is the cross-language evaluation forum, an information retrieval conference with heavy emphasis of experimentation on shared tasks. GeoLifeCLEF is a challenge hosted by the LifeCLEF lab within CLEF.

GeoLifeCLEF combines five million heterogeneous presence-only records and six thousand exhaustive presence-absence surveys collected from 2017 to 2021. Models are trained with environmental data like 10-meter resolution RGB and Near-Infra-Red satellite images and climatic variables.

Data Preprocessing

We transform domain-specific geospatial rasters (GeoTIFF) into a format optimized for distributed, parallel data access patterns (Parquet). We convert an area of interest (AOI) into a regular lattice of square tiles and store relevant features cropped by the bounding box of its tile. We store all data in a Parquet dataset to load in bulk to Spark or Torch.

We create two development datasets with a maximum partition size of 1GB. The first is a subset of the data that covers a small geographic area encompassing a city, forest, and mountain. The second is a label dataset that contains the minimum features for density estimation, e.g., latitude, longitude, date, and positive indicator of species.

Modeling


Mermaid

Our system is composed of four models. We use Tile2Vec to embed geo-rasters and a linear operator estimator to embed high-dimensional time series. These models aim to learn a low-dimensional representation of the data that preserve certain geometrical properties like the triangle inequality. We fit an ordinal regression to learn the relative frequency of biodiversity across a regular lattice of features. We do this by converting positive examples into ranked lists generated by nearest neighbor labels in feature space and fitting a learning-to-rank model. We finally learn a generative model of the data to generate biodiversity rasters and images using priors from ordinal regression.

Our baseline model is a species model derived from geolocation and date. We measure improvement upon the baseline by adding learned geo and time series embeddings via ablation study.

End-to-end Task

We submit the results of our system, intending to reach first place on the leaderboard. We intend to see significant improvements between baseline models and more complex models. In addition to submitting to the leaderboard, we generate detailed rasters/images of various species for visualization.

Timeline

LifeCLEF 2024 | ImageCLEF / LifeCLEF - Multimedia Retrieval in CLEF

  • Jan 2024: registration opens for all LifeCLEF challenges
  • Jan-March 2024: training and test data release
  • 6 May 2024: deadline for submission of runs by participants
  • 13 May 2024: release of processed results by the task organizers
  • 31 May 2024: deadline for submission of working note papers by participants [CEUR-WS proceedings]
  • 24 June 2024: notification of acceptance of participant's working note papers [CEUR-WS proceedings]
  • 8 July 2024: camera ready copy of participant's working note papers and extended lab overviews by organizers
  • 9-12 Sept 2024: CLEF 2024 Grenoble - France
Date Week Task/Topic Deliverable/Events
2024-01-08 1 Engineering - Download training and testing dataset from 2023/2024 Competition start
2024-01-15 2 Exploratory Data Analysis
2024-01-22 3 Engineering - Schema and Parquet
2024-01-29 4 Engineering - Schema and Parquet Parquet datasets in GCS, dev set of data (\<1GB single partition) available for exploratory modeling
2024-02-05 5 Modeling - Learning to Rank
2024-02-12 6 Modeling - Gaussian Mixture Models and Stochastic Variational Inference
2024-02-19 7 Modeling - Tile2Vec
2024-02-26 8 Modeling - Tile2Vec
2024-03-04 9 Modeling - Tile2Vec
2024-03-11 10 Modeling - Koopman Operator, SVD, Dynamic Mode Decomposition Working notes of dataset and model description
2024-03-18 11 Spring Break
2024-03-25 12 Engineering - Embedding cache, indexing and search Modeling - Ordinal regression
2024-04-01 13 Engineering - Model pipeline First submission to the competition, screenshot of leaderboard
2024-04-08 14 Engineering - Model pipeline
2024-04-15 15 Ablation Study, Hyperparameter Tuning
2024-04-22 16 Ablation Study, Hyperparameter Tuning
2024-04-29 17 Finals, Working notes Submission deadline for competition, first draft of working notes, screenshot of leaderboard, parquet dataset in GCS
2024-05-06 18 Summer, Working notes revision

Infrastructure

Code is hosted on GitHub at https://github.com/dsgt-kaggle-clef/geolifeclef-2024. Cloud compute and storage is on Google Cloud Platform with a personal billing account.

Collaboration and Supervision

This project stems from collaboration within the Data Science at Georgia Tech (DS@GT) student group. Prior submissions from the DS@GT team to the CLEF conference have won $5,000 worth of prizes across two best working note competitions.

As the DS@GT GeoLifeCLEF 2024 team lead, I would be collaborating with two fellow OMSCS students. The time-commitment estimate (3 credit hours) is for independent work that I carry out in the context of shared responsibilities in the team.

The supervising faculty member for the project is responsible for administration, such as registration and grading, with no expectation to advise the research process (although pointers are greatly appreciated). The supervisor will grade using an article for publication in a state ready for early review.

References

Botella, C., Deneu, B., Marcos, D., Servajean, M., Estopinan, J., Larcher, T., ... & Joly, A. (2023). The GeoLifeCLEF 2023 Dataset to evaluate plant species distribution models at high spatial resolution across Europe. arXiv preprint arXiv:2308.05121., https://arxiv.org/abs/2308.05121

Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., & Ermon, S. (2018). Tile2Vec: unsupervised representation learning for spatially distributed data. arXiv., https://arxiv.org/abs/1805.02855

Brunton, S. L., & Kutz, J. N. (2019). Data-driven science and engineering: Machine learning, dynamical systems, and control. Cambridge University Press., https://www.cambridge.org/core/books/datadriven-science-and-engineering/77D52B171B60A496EAFE4DB662ADC36E

Cao, Z., Qin, T., Liu, T. Y., Tsai, M. F., & Li, H. (2007, June). Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning (pp. 129-136)., https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-2007-40.pdf

Hoffman, M. D., Blei, D. M., Wang, C., & Paisley, J. (2013). Stochastic variational inference. Journal of Machine Learning Research., https://jmlr.org/papers/volume14/hoffman13a/hoffman13a.pdf