View Source Code
Browse the complete example on GitHub
Whatβs inside?
In this example, you will learn how to:- Fine-tune a vision-language model using leap-finetune, Liquid AIβs fine-tuning framework for LFM models
- Run data preparation and training in the cloud using Modal, without managing GPU infrastructure locally
- Prepare satellite imagery data from VRSBench, a NeurIPS 2024 dataset with three task types:
- VQA: answer questions about satellite images (123K QA pairs)
- Visual Grounding: detect and localize objects with bounding boxes (52K references)
- Captioning: generate detailed descriptions of satellite scenes (29K captions)
This workflow runs the heavy computation in the cloud. The local machine only launches jobs and streams logs, so no local GPU is required.
Environment setup
You will need:- uv to manage Python dependencies
- Modal for serverless CPU and GPU jobs
- A Hugging Face account to download VRSBench
- Access to the leap-finetune repository
How to run it?
Start by cloning the cookbook repository and opening the example directory:1. Install and authenticate
Modal provides serverless GPUs you pay for per second. New accounts include free credits, enough to run this example end to end.2. Prepare the data
Download and convert VRSBench inside a Modal container. The converted data is pushed to a Modal volume where the fine-tuning job will read it.3. Clone leap-finetune and start training
Run the training job on an H100. Checkpoints are saved to thesatellite-vlm Modal volume.
At this point, Modal streams the training logs back to your terminal while the job runs remotely.
How it works
All heavy computation runs in the cloud. The only things that run locally are theprepare_vrsbench.py launcher and the leap-finetune CLI.
- Data prep (Modal CPU container):
prepare_vrsbench.py --modaldownloads VRSBench (~12 GB) from Hugging Face, converts it to JSONL, and writes images and annotations to a Modal volume namedsatellite-vlm. - Training (Modal H100):
leap-finetunesubmits a training job that reads data from the same volume, fine-tunes the model, and saves checkpoints back to the volume. - Retrieval (local): you pull checkpoints from the volume to your machine with
modal volume get.
Data preparation
prepare_vrsbench.py downloads VRSBench from Hugging Face and converts it to the JSONL format required by leap-finetune.
- Modal (recommended)
- Local development
Run entirely in the cloud and write directly to the
satellite-vlm Modal volume:./data/ locally, or to the Modal volume with --modal:
vrsbench_{task}_train.jsonl: training datavrsbench_{task}_eval.jsonl: evaluation data
Training
Run from theleap-finetune root, cloned during the quickstart:
satellite-vlm Modal volume under /satellite-vlm/outputs/.
To enable experiment tracking, uncomment
tracker: "wandb" in the config.Retrieving checkpoints
List and download checkpoints from the Modal volume:Data format
The grounding task uses JSON bounding box format with 0-1 normalized coordinates, matching the LFM VLM pretraining format:Evaluation
Benchmarks run automatically during training at everyeval_steps:
- VQA:
short_answermetric (case-insensitive substring match) - Grounding:
grounding_ioumetric (IoU@0.5 threshold) - Captioning:
CIDErorBLEUmetrics
limit field in the YAML config for faster iteration.
Run a full standalone evaluation
Run a full standalone evaluation
To evaluate on the complete dataset without retraining, use
configs/vrsbench_full_eval.yaml.Set eval_on_start: true, remove the limit fields, and point the config at your checkpoint path. The model runs the full evaluation at step 0, logs results to WandB, and terminates.AI in Space Hackathon
This example is the official starting point for the AI in Space Hackathon, a fully online event organized in partnership between DPhi Space and Liquid AI, open to builders from around the globe. Use this fine-tuning pipeline as your baseline and push it further with real satellite data.
Register for the Hackathon
Join the AI in Space Hackathon and build with satellite AI.
Join our Discord
Connect with the community and ask questions about this example.