Abstract
3D Computer vision offers powerful tools for efficient agricultural analysis by enabling automated extraction of quantitative traits using point clouds of field scenes. In the context of wheat, accurate analysis of wheat head morphology is challenging because the acquisition of high resolution point clouds is difficult and annotating them for instance segmentation requires substantial manual effort. While 3D instance segmentation has shown promise for such tasks by explicitly modeling geometric structure, existing approaches often use simulated data or data obtained in highly controlled indoor setups. As a result, they struggle to achieve reliable instance coverage in real field conditions. In this work, we study 3D instance segmentation of wheat heads in real in-field point clouds and introduce WheatFormer3D, a transformer-based framework designed to improve query coverage of individual wheat heads in crowded scenes. We further propose domain-specific geometric augmentations that increase data efficiency and
robustness in data-scarce agricultural settings. Extensive experiments demonstrate that the proposed approach consistently outperforms recent transformer-based baselines, including OneFormer3D and Mask3D, on wheat head instance segmentation, achieving 87.96 AP@50 and 77.99 AP overall. In addition, we investigate the use of segmentation outputs for downstream phenotyping tasks and construct a reference organ-level dataset with paired indoor and in-field wheat head scans and reference volume measurements. Using this dataset, we explore the feasibility and current limitations of learning-based volume estimation from real-world point clouds, highlighting challenges associated with noisy in-field reconstructions.