Research Proposal Example: GeoLifeCLEF 2024
CS8903: Special Problems Project Proposal
Name: Anthony Miyaguchi <acmiyaguchi@gatech.edu>
Student ID: amiyaguchi3 Date: 2023-11-08
- Main Proposal Idea: Lead a DS@GT team on the GeoLifeCLEF 2024 challenge and submit a working note paper at the CLEF 2024 conference.
- The GeoLifeCLEF 2023 Dataset to evaluate plant species distribution models at high spatial resolution across Europe
- SPECIAL PROBLEMS (8903) PERMIT
Spatio-temporal species distribution estimation for GeoLifeCLEF 2024 with unsupervised representation learning of remote sensing data
Objective
The objective of the special problems is to solve the GeoLifeCLEF 2024 challenge and publish a working notes paper to the CLEF 2024 conference detailing the implemented system, in collaboration with student peers at Georgia Tech. The resulting scope of work is estimated to take 3-credit hours, or 150 hours of work, by the primary author.
Background and Motivation
CLEF is the cross-language evaluation forum, an information retrieval conference with heavy emphasis of experimentation on shared tasks. GeoLifeCLEF is a challenge hosted by the LifeCLEF lab within CLEF.
GeoLifeCLEF combines five million heterogeneous presence-only records and six thousand exhaustive presence-absence surveys collected from 2017 to 2021. Models are trained with environmental data like 10-meter resolution RGB and Near-Infra-Red satellite images and climatic variables.
Data Preprocessing
We transform domain-specific geospatial rasters (GeoTIFF) into a format optimized for distributed, parallel data access patterns (Parquet). We convert an area of interest (AOI) into a regular lattice of square tiles and store relevant features cropped by the bounding box of its tile. We store all data in a Parquet dataset to load in bulk to Spark or Torch.
We create two development datasets with a maximum partition size of 1GB. The first is a subset of the data that covers a small geographic area encompassing a city, forest, and mountain. The second is a label dataset that contains the minimum features for density estimation, e.g., latitude, longitude, date, and positive indicator of species.
Modeling
Our system is composed of four models. We use Tile2Vec to embed geo-rasters and a linear operator estimator to embed high-dimensional time series. These models aim to learn a low-dimensional representation of the data that preserve certain geometrical properties like the triangle inequality. We fit an ordinal regression to learn the relative frequency of biodiversity across a regular lattice of features. We do this by converting positive examples into ranked lists generated by nearest neighbor labels in feature space and fitting a learning-to-rank model. We finally learn a generative model of the data to generate biodiversity rasters and images using priors from ordinal regression.
Our baseline model is a species model derived from geolocation and date. We measure improvement upon the baseline by adding learned geo and time series embeddings via ablation study.
End-to-end Task
We submit the results of our system, intending to reach first place on the leaderboard. We intend to see significant improvements between baseline models and more complex models. In addition to submitting to the leaderboard, we generate detailed rasters/images of various species for visualization.
Timeline
LifeCLEF 2024 | ImageCLEF / LifeCLEF - Multimedia Retrieval in CLEF
- Jan 2024: registration opens for all LifeCLEF challenges
- Jan-March 2024: training and test data release
- 6 May 2024: deadline for submission of runs by participants
- 13 May 2024: release of processed results by the task organizers
- 31 May 2024: deadline for submission of working note papers by participants [CEUR-WS proceedings]
- 24 June 2024: notification of acceptance of participant's working note papers [CEUR-WS proceedings]
- 8 July 2024: camera ready copy of participant's working note papers and extended lab overviews by organizers
- 9-12 Sept 2024: CLEF 2024 Grenoble - France
Date | Week | Task/Topic | Deliverable/Events |
---|---|---|---|
2024-01-08 | 1 | Engineering - Download training and testing dataset from 2023/2024 | Competition start |
2024-01-15 | 2 | Exploratory Data Analysis | |
2024-01-22 | 3 | Engineering - Schema and Parquet | |
2024-01-29 | 4 | Engineering - Schema and Parquet | Parquet datasets in GCS, dev set of data (\<1GB single partition) available for exploratory modeling |
2024-02-05 | 5 | Modeling - Learning to Rank | |
2024-02-12 | 6 | Modeling - Gaussian Mixture Models and Stochastic Variational Inference | |
2024-02-19 | 7 | Modeling - Tile2Vec | |
2024-02-26 | 8 | Modeling - Tile2Vec | |
2024-03-04 | 9 | Modeling - Tile2Vec | |
2024-03-11 | 10 | Modeling - Koopman Operator, SVD, Dynamic Mode Decomposition | Working notes of dataset and model description |
2024-03-18 | 11 | Spring Break | |
2024-03-25 | 12 | Engineering - Embedding cache, indexing and search Modeling - Ordinal regression | |
2024-04-01 | 13 | Engineering - Model pipeline | First submission to the competition, screenshot of leaderboard |
2024-04-08 | 14 | Engineering - Model pipeline | |
2024-04-15 | 15 | Ablation Study, Hyperparameter Tuning | |
2024-04-22 | 16 | Ablation Study, Hyperparameter Tuning | |
2024-04-29 | 17 | Finals, Working notes | Submission deadline for competition, first draft of working notes, screenshot of leaderboard, parquet dataset in GCS |
2024-05-06 | 18 | Summer, Working notes revision |
Infrastructure
Code is hosted on GitHub at https://github.com/dsgt-kaggle-clef/geolifeclef-2024. Cloud compute and storage is on Google Cloud Platform with a personal billing account.
Collaboration and Supervision
This project stems from collaboration within the Data Science at Georgia Tech (DS@GT) student group. Prior submissions from the DS@GT team to the CLEF conference have won $5,000 worth of prizes across two best working note competitions.
As the DS@GT GeoLifeCLEF 2024 team lead, I would be collaborating with two fellow OMSCS students. The time-commitment estimate (3 credit hours) is for independent work that I carry out in the context of shared responsibilities in the team.
The supervising faculty member for the project is responsible for administration, such as registration and grading, with no expectation to advise the research process (although pointers are greatly appreciated). The supervisor will grade using an article for publication in a state ready for early review.
References
Botella, C., Deneu, B., Marcos, D., Servajean, M., Estopinan, J., Larcher, T., ... & Joly, A. (2023). The GeoLifeCLEF 2023 Dataset to evaluate plant species distribution models at high spatial resolution across Europe. arXiv preprint arXiv:2308.05121., https://arxiv.org/abs/2308.05121
Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., & Ermon, S. (2018). Tile2Vec: unsupervised representation learning for spatially distributed data. arXiv., https://arxiv.org/abs/1805.02855
Brunton, S. L., & Kutz, J. N. (2019). Data-driven science and engineering: Machine learning, dynamical systems, and control. Cambridge University Press., https://www.cambridge.org/core/books/datadriven-science-and-engineering/77D52B171B60A496EAFE4DB662ADC36E
Cao, Z., Qin, T., Liu, T. Y., Tsai, M. F., & Li, H. (2007, June). Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning (pp. 129-136)., https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-2007-40.pdf
Hoffman, M. D., Blei, D. M., Wang, C., & Paisley, J. (2013). Stochastic variational inference. Journal of Machine Learning Research., https://jmlr.org/papers/volume14/hoffman13a/hoffman13a.pdf