PubSummarizer - CVPR 2024 Oral Papers

360+x: A Panoptic Multi-modal Scene Understanding Dataset

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Chen_360x_A_Panoptic_Multi-modal_Scene_Understanding_Dataset_CVPR_2024_paper.pdf

Topics

dataset scene understanding multi-modal egocentric panoptic

Summary

The paper introduces the 360+x dataset, a pioneering multi-modal resource designed for comprehensive scene understanding from multiple perspectives, including panoramic, third-person, and egocentric views. It incorporates various data modalities such as video, multi-channel audio, directional binaural delay, location information, and textual descriptions, making it the first dataset to mimic human-like perception of environments. The authors conducted extensive benchmark analyses across five scene understanding tasks, revealing that models utilizing this diverse dataset significantly improve performance compared to existing datasets, particularly through the integration of different viewpoints and modalities. The findings suggest that even self-supervised models trained on 360+x can outperform those trained with human annotations, underscoring the dataset's potential to advance research in scene understanding.

A Subspace-Constrained Tyler's Estimator and its Applications to Structure from Motion

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Yu_A_Subspace-Constrained_Tylers_Estimator_and_its_Applications_to_Structure_from_CVPR_2024_paper.pdf

Topics

robust statistics subspace recovery computer vision Tyler's estimator Structure from Motion

Summary

The paper introduces the Subspace-Constrained Tyler's Estimator (STE), a novel algorithm designed for robust subspace recovery in datasets plagued by outliers. Combining aspects of Tyler's M-estimator and fast median subspace techniques, STE effectively recovers low-dimensional subspaces even when the proportion of inliers is less than previously established theoretical thresholds. The authors validate STE through its application to Structure from Motion (SfM), focusing on robust fundamental matrix estimation and the removal of outlying cameras. Numerical experiments demonstrate STE's superior performance compared to existing methods, showcasing its potential to enhance robustness in computer vision tasks, particularly in 3D reconstruction scenarios.

Alchemist: Parametric Control of Material Properties with Diffusion Models

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Sharma_Alchemist_Parametric_Control_of_Material_Properties_with_Diffusion_Models_CVPR_2024_paper.pdf

Topics

Material properties Diffusion models Image editing Photorealism Generative models

Summary

The paper presents "Alchemist," a novel method for manipulating material attributes such as roughness, metallicity, albedo, and transparency in real images using a modified text-to-image diffusion model. By addressing the scarcity of datasets with controlled material properties, the authors created a synthetic dataset featuring physically-based materials and fine-tuned a diffusion model on this data. The model allows for precise editing of material properties while preserving other image characteristics, offering alternatives to traditional rendering techniques that typically require extensive auxiliary information. Results demonstrate the model's effectiveness in editing real-world images and extend its application to Neural Radiance Fields (NeRF), showcasing its potential for various commercial applications in image editing and beyond.

An N-Point Linear Solver for Line and Motion Estimation with Event Cameras

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Gao_An_N-Point_Linear_Solver_for_Line_and_Motion_Estimation_with_CVPR_2024_paper.pdf

Topics

event cameras motion estimation linear solver numerical stability velocity averaging

Summary

This paper presents a novel N-point linear solver for line-based motion estimation using event cameras, which excel in high-speed and low-light conditions compared to traditional frame-based cameras. The authors introduce a new line parametrization that reduces the degrees of freedom from four to three, enabling a more efficient and numerically stable linear solver that can handle both minimal and overdetermined systems with more than five events. The proposed method showcases significant improvements in runtime—over 600 times faster than previous polynomial solvers—while maintaining high numerical stability and the ability to characterize degenerate cases. Additionally, a new velocity averaging scheme is introduced for efficiently fusing observations from multiple lines, enhancing the overall performance in both synthetic and real-world experiments, thereby demonstrating its suitability for modern mobile vision applications.

Analyzing and Improving the Training Dynamics of Diffusion Models

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Karras_Analyzing_and_Improving_the_Training_Dynamics_of_Diffusion_Models_CVPR_2024_paper.pdf

Topics

Diffusion Models Training Dynamics Image Synthesis Neural Networks Exponential Moving Average (EMA)

Summary

This paper presents an analysis and improvement of the training dynamics for diffusion models, specifically focusing on the ADM architecture. The authors identify several issues leading to uneven training, such as uncontrolled changes in network activations and weights. They propose modifications to standardize these magnitudes without altering the architecture's structure, resulting in enhanced performance, including a record FID score of 1.81 for ImageNet-512 synthesis. Additionally, they introduce a post-hoc method for adjusting EMA parameters post-training, enabling precise tuning and revealing significant interactions between EMA settings and network configurations. The findings suggest that the improved architecture and EMA techniques can facilitate more effective training and quality control in generative image synthesis.

Attention Calibration for Disentangled Text-to-Image Personalization

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Zhang_Attention_Calibration_for_Disentangled_Text-to-Image_Personalization_CVPR_2024_paper.pdf

Topics

Text-to-image synthesis Attention calibration Personalization Disentangled representation Image generation

Summary

The paper introduces DisenDiff, a novel personalized text-to-image (T2I) model that enhances the generation of customized images by effectively capturing and disentangling multiple concepts from a single reference image. It addresses limitations of existing methods, which often compromise visual consistency and fail to separate concepts adequately. The authors propose an attention calibration mechanism that includes learnable modifiers for different concepts, along with constraints to improve attention mapping and reduce concept interference. Through extensive qualitative and quantitative evaluations, the proposed method outperforms current state-of-the-art techniques, demonstrating superior visual fidelity and editing flexibility while also being compatible with existing image enhancement frameworks like LoRA.

Bilateral Event Mining and Complementary for Event Stream Super-Resolution

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Huang_Bilateral_Event_Mining_and_Complementary_for_Event_Stream_Super-Resolution_CVPR_2024_paper.pdf

Topics

Event Stream Super-Resolution Bilateral Information Exchange Event Mining Neural Networks Object Recognition

Summary

The paper presents a novel approach to Event Stream Super-Resolution (ESR) through the development of a Bilateral Event Mining and Complementary Network (BMCNet). This method distinguishes between positive and negative events in event streams, utilizing a two-stream architecture to process each event type individually while facilitating their interaction via a Bilateral Information Exchange (BIE) module. The BMCNet effectively captures and exchanges complementary spatial and temporal information, significantly improving performance in ESR by over 11% compared to previous state-of-the-art methods. Additionally, the proposed framework enhances downstream tasks such as object recognition and video reconstruction, demonstrating its versatility and effectiveness in processing event camera data.

BioCLIP: A Vision Foundation Model for the Tree of Life

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Stevens_BioCLIP_A_Vision_Foundation_Model_for_the_Tree_of_Life_CVPR_2024_paper.pdf

Topics

foundation model biodiversity computer vision taxonomic hierarchy machine learning

Summary

This paper presents BIOCLIP, a vision foundation model that leverages a newly curated dataset, TREEOFLIFE-10M, which contains over 10 million images across 454,000 taxa, aimed at enhancing the application of computer vision in biological research and conservation. The authors argue that existing models are often tailored for specific tasks and lack the adaptability needed for general organismal biology questions. BIOCLIP employs a contrastive learning approach to learn hierarchical representations aligned with the biological taxonomy, demonstrating substantial improvements over existing models in both zero-shot and few-shot classification tasks. The results indicate that BIOCLIP not only excels in identifying known species but also generalizes effectively to unseen taxa, significantly lowering the barriers for biologists to utilize AI in their work. The paper highlights the importance of dataset diversity and the hierarchical structure of taxonomic labels in achieving strong performance in biological image classification.

Comparing the Decision-Making Mechanisms by Transformers and CNNs via Explanation Methods

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Jiang_Comparing_the_Decision-Making_Mechanisms_by_Transformers_and_CNNs_via_Explanation_CVPR_2024_paper.pdf

Topics

decision-making transformers CNNs compositionality explanation methods

Summary

This paper investigates the decision-making mechanisms of visual recognition networks, specifically comparing Transformers and CNNs, using two novel methodologies: sub-explanation counting and cross-testing. The authors find that Transformers and ConvNeXt models exhibit greater compositionality, meaning they integrate multiple image parts for decisions, while traditional CNNs and distilled Transformers demonstrate disjunctive behaviors, relying on fewer parts for predictions. Key factors influencing these behaviors include the type of normalization used, with batch normalization leading to less compositionality compared to layer and group normalization. Additionally, cross-testing reveals that different network architectures utilize distinct visual features for classification, providing insights into their decision-making processes and suggesting directions for future model design.

Correlation-aware Coarse-to-fine MLPs for Deformable Medical Image Registration

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Meng_Correlation-aware_Coarse-to-fine_MLPs_for_Deformable_Medical_Image_Registration_CVPR_2024_paper.pdf

Topics

deformable image registration medical imaging deep learning multi-layer perceptrons transformers

Summary

This paper presents the correlation-aware multi-layer perceptron (CorrMLP), a novel approach for deformable medical image registration that aims to address the limitations of traditional transformers and convolutional neural networks (CNNs). While transformers have been effective in capturing long-range dependencies, their high computational demands restrict their application at full image resolutions, hampering fine-grained registration. In contrast, the CorrMLP utilizes a correlation-aware multi-window MLP block within a coarse-to-fine architecture, enabling efficient processing at full resolution and capturing local correlations vital for accurate registration. Extensive experiments on various medical datasets demonstrate that CorrMLP surpasses state-of-the-art methods in registration accuracy and transformation smoothness, highlighting the potential of MLPs in medical image registration tasks.

CroSel: Cross Selection of Confident Pseudo Labels for Partial-Label Learning

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Tian_CroSel_Cross_Selection_of_Confident_Pseudo_Labels_for_Partial-Label_Learning_CVPR_2024_paper.pdf

Topics

Partial-label learning Cross-selection Consistency regularization Deep learning Pseudo labels

Summary

The paper presents CroSel, a novel approach to partial-label learning (PLL) that tackles the challenge of label ambiguity by selecting confident pseudo labels from a candidate set. CroSel utilizes a cross-selection strategy where two deep models exchange and refine their label predictions based on historical outputs, aiming to accurately identify true labels amidst noise. Additionally, it introduces a consistency regularization term called co-mix to mitigate sample waste and improve label selection accuracy. Empirical results demonstrate CroSel's effectiveness, achieving state-of-the-art performance on benchmark datasets, highlighting its ability to maintain high precision in label selection even under varying noise conditions.

DART: Implicit Doppler Tomography for Radar Novel View Synthesis

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Huang_DART_Implicit_Doppler_Tomography_for_Radar_Novel_View_Synthesis_CVPR_2024_paper.pdf

Topics

radar imaging novel view synthesis Doppler effect neural networks simulation

Summary

The paper introduces DART (Doppler-Aided Radar Tomography), a novel approach for synthesizing radar range-Doppler images using a data-driven, Neural Radiance Field-inspired method. By incorporating radar-specific physics into an implicit rendering pipeline, DART enables the synthesis of accurate radar images from various viewpoints without explicit scene modeling. The authors constructed a custom data collection platform and a novel radar dataset to validate DART's efficacy against existing methods, demonstrating that it consistently outperforms state-of-the-art techniques in generating high-quality tomographic images. The method leverages the Doppler effect to enhance the resolution of radar measurements and presents a framework for realistic radar simulations that could significantly benefit applications in localization, mapping, and recognition.

Deep Generative Model based Rate-Distortion for Image Downscaling Assessment

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Liang_Deep_Generative_Model_based_Rate-Distortion_for_Image_Downscaling_Assessment_CVPR_2024_paper.pdf

Topics

image downscaling rate-distortion theory deep generative models assessment metrics super-resolution

Summary

This paper introduces a novel evaluation metric for image downscaling algorithms, called Image Downscaling Assessment by Rate-Distortion (IDA-RD), which quantifies the distortion incurred during the downscaling process by leveraging rate-distortion theory. Unlike traditional image-based quality measures, IDA-RD employs a process-based approach that views downscaling and super-resolution as encoding and decoding operations, respectively. The authors demonstrate that effective downscaling algorithms preserve more detail, leading to less distortion when images are upscaled. They address the challenges of measuring distortion through the use of recent advancements in deep generative models, specifically Generative Adversarial Networks (GANs) and Normalizing Flows, enabling the evaluation of downscaled images without requiring ground truth low-resolution images. Extensive experiments validate the effectiveness of IDA-RD across various synthetic and real-world downscaling methods, highlighting its potential to fill a significant gap in image downscaling research.

Describing Differences in Image Sets with Natural Language

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Dunlap_Describing_Differences_in_Image_Sets_with_Natural_Language_CVPR_2024_paper.pdf

Topics

image analysis natural language processing dataset comparison machine learning automated insights

Summary

This paper introduces Set Difference Captioning, a novel task aimed at automatically describing the differences between two sets of images, termed VisDiff. The approach consists of a two-stage method involving a proposer that generates candidate descriptions from image sets and a ranker that evaluates these descriptions for salience. The authors present VisDiffBench, a benchmark dataset with 187 paired image sets to evaluate the method's performance. The results demonstrate VisDiff's effectiveness in identifying nuanced differences across various domains, such as model comparisons and dataset analysis, underscoring its potential as a tool for generating human-interpretable insights in computer vision and machine learning applications.

Diffusion-FOF: Single-View Clothed Human Reconstruction via Diffusion-Based Fourier Occupancy Field

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Li_Diffusion-FOF_Single-View_Clothed_Human_Reconstruction_via_Diffusion-Based_Fourier_Occupancy_Field_CVPR_2024_paper.pdf

Topics

3D reconstruction clothed human single-view images diffusion model Fourier occupancy field

Summary

This paper presents a novel method called Diffusion-FOF for reconstructing 3D models of clothed humans from single-view images, addressing challenges such as varying body shapes, poses, and detailed textures. The method involves predicting a back-view image using a style consistency constraint, extracting multi-scale features, and employing a diffusion-based Fourier occupancy field (FOF) model in the wavelet domain to enhance geometric accuracy. The approach effectively integrates information from both the reference and estimated back-view images, culminating in the generation of a textured human model. Experimental results demonstrate that this method surpasses existing state-of-the-art techniques in both geometric and texture reconstruction performance.

DiffusionLight: Light Probes for Free by Painting a Chrome Ball

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Phongthawee_DiffusionLight_Light_Probes_for_Free_by_Painting_a_Chrome_Ball_CVPR_2024_paper.pdf

Topics

light estimation diffusion models HDR imaging chrome ball rendering computer vision

Summary

The paper introduces DiffusionLight, a novel technique for estimating lighting from a single input image by inpainting a chrome ball using a pre-trained diffusion model (Stable Diffusion XL). Traditional methods often rely on HDR panorama datasets, which limit their effectiveness in real-world scenarios due to dataset diversity constraints. In contrast, this approach leverages the extensive training of diffusion models on billions of images, enhancing light estimation in uncontrolled environments. Key innovations include an iterative inpainting algorithm to ensure high-quality chrome ball generation and a LoRA fine-tuning technique for exposure bracketing, allowing the production of HDR chrome balls. The method demonstrates superior performance against existing techniques across various benchmarks and generalizes well to in-the-wild images, revealing significant advantages in lighting estimation tasks.

Eclipse: Disambiguating Illumination and Materials using Unintended Shadows

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Verbin_Eclipse_Disambiguating_Illumination_and_Materials_using_Unintended_Shadows_CVPR_2024_paper.pdf

Topics

illumination materials occluders shadows inverse rendering

Summary

This paper presents a novel approach to the inverse rendering problem, which aims to recover an object's material properties and the surrounding illumination using unintended shadows cast by unobserved occluders, such as the camera operator. The authors utilize differentiable Monte Carlo ray tracing to jointly estimate spatially-varying materials, environment illumination, and the shapes of occluders that inadvertently cast shadows. By leveraging these shadows as additional signals, the method improves the conditioning of the inverse rendering problem, enabling more accurate recovery of high-frequency illumination and material details, even in challenging scenarios with diffuse materials. The effectiveness of the approach is demonstrated through experiments on both synthetic and real-world captured data, indicating its potential for enhancing the quality of material and lighting estimations in realistic imaging conditions.

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Grauman_Ego-Exo4D_Understanding_Skilled_Human_Activity_from_First-_and_Third-Person_Perspectives_CVPR_2024_paper.pdf

Topics

dataset human activity multimodal egocentric exocentric

Summary

The paper introduces Ego-Exo4D, a large-scale, multimodal, and multiview video dataset designed to enhance the understanding of skilled human activities from both egocentric (first-person) and exocentric (third-person) perspectives. Captured from 740 participants across 13 cities, the dataset includes 1,286 hours of video featuring various activities like sports, music, and cooking, complemented by extensive annotations such as audio, eye gaze, and 3D point clouds. It aims to facilitate research in areas like skill learning, proficiency estimation, and cross-view translation through a set of benchmark tasks. The open-sourced resources are intended to foster advancements in AI's comprehension of human skills and promote novel applications in domains such as augmented reality and robotics.

EgoGen: An Egocentric Synthetic Data Generator

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Li_EgoGen_An_Egocentric_Synthetic_Data_Generator_CVPR_2024_paper.pdf

Topics

synthetic data generation egocentric perception human motion synthesis reinforcement learning augmented reality

Summary

The paper presents EgoGen, a novel synthetic data generation system designed for egocentric perception tasks, particularly in augmented reality applications. EgoGen addresses the challenge of simulating natural human movements from the perspective of head-mounted devices by utilizing a generative human motion synthesis model that incorporates egocentric visual inputs. This model employs collision-avoiding motion primitives and a two-stage reinforcement learning approach to create realistic and diverse human motions in dynamic environments. The system generates high-quality synthetic data with accurate ground truth annotations, enhancing performance in key tasks such as mapping, localization, camera tracking, and human mesh recovery from egocentric views. By providing a scalable and effective solution for creating egocentric training data, EgoGen aims to advance research in egocentric computer vision.

EGTR: Extracting Graph from Transformer for Scene Graph Generation

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Im_EGTR_Extracting_Graph_from_Transformer_for_Scene_Graph_Generation_CVPR_2024_paper.pdf

Topics

Scene Graph Generation Object Detection Transformer Models Self-Attention Relation Extraction

Summary

This paper presents EGTR (Extracting Graph from Transformer), a lightweight one-stage model for Scene Graph Generation (SGG) that efficiently extracts relationships between objects from the self-attention layers of the DETR (DEtection TRansformer) decoder. Unlike traditional two-stage models, EGTR leverages the inherent relationships learned during object detection, utilizing a novel adaptive smoothing technique to enhance multi-task learning for both object detection and relation extraction. Additionally, it introduces a connectivity prediction task to aid relation prediction. Experimental results on the Visual Genome and Open Images V6 datasets demonstrate that EGTR achieves superior object detection performance and comparable triplet detection accuracy while maintaining reduced model complexity and faster inference speeds.

EscherNet: A Generative Model for Scalable View Synthesis

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Kong_EscherNet_A_Generative_Model_for_Scalable_View_Synthesis_CVPR_2024_paper.pdf

Topics

view synthesis generative models 3D reconstruction camera positional encoding diffusion models

Summary

The paper presents EscherNet, a novel multi-view conditioned diffusion model that facilitates scalable view synthesis by generating consistent target views from arbitrary camera poses based on a flexible number of reference views. EscherNet employs a unique camera positional encoding (CaPE) to enhance camera control and ensure consistency across generated views. Demonstrating remarkable scalability, it can produce over 100 target views simultaneously on a consumer-grade GPU while achieving state-of-the-art performance compared to existing models. By decoupling from scene-specific optimizations and enabling zero-shot novel view synthesis, EscherNet unifies single and multi-image 3D reconstruction tasks, paving the way for advancements in 3D vision architectures.

EvDiG: Event-guided Direct and Global Components Separation

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Zhou_EvDiG_Event-guided_Direct_and_Global_Components_Separation_CVPR_2024_paper.pdf

Topics

event camera illumination separation global illumination machine learning image processing

Summary

The paper presents EvDiG, a novel method for separating direct and global illumination components in images using a hybrid system of RGB and event cameras. By leveraging the high temporal resolution of event cameras, the proposed approach efficiently captures rapid illumination changes caused by moving shadows, significantly reducing data acquisition time. The method employs a two-stage neural network, EvSepNet, to refine coarse separation results and restore color information, addressing challenges such as noise and color loss inherent in event data. Experimental results demonstrate that EvDiG outperforms state-of-the-art methods in both indoor and outdoor scenes, achieving high-quality separation comparable to multi-frame techniques while maintaining a capture time equivalent to single-frame methods.

EventPS: Real-Time Photometric Stereo Using an Event Camera

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Yu_EventPS_Real-Time_Photometric_Stereo_Using_an_Event_Camera_CVPR_2024_paper.pdf

Topics

Event Camera Photometric Stereo Real-Time Processing Surface Normal Estimation Deep Learning

Summary

The paper introduces EventPS, a novel approach to real-time photometric stereo using event cameras, which significantly enhances data efficiency and speed compared to traditional frame-based methods. By leveraging the high temporal resolution and low bandwidth characteristics of event cameras, EventPS estimates surface normals through radiance changes induced by a continuously rotating light source. This method offers a robust solution for both Lambertian and non-Lambertian surfaces by integrating optimization-based and deep-learning techniques. Experimental results demonstrate that EventPS operates at over 30 frames per second while reducing bandwidth usage to approximately 31% compared to frame-based counterparts, making it suitable for high-speed and time-sensitive applications.

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Tong_Eyes_Wide_Shut_Exploring_the_Visual_Shortcomings_of_Multimodal_LLMs_CVPR_2024_paper.pdf

Topics

multimodal LLMs visual grounding CLIP model visual question answering performance evaluation

Summary

This paper investigates the visual shortcomings of multimodal large language models (LLMs), specifically focusing on their reliance on the CLIP model for visual understanding. The authors identify systematic failures in visual question answering capabilities across various state-of-the-art LLMs, including GPT-4V, using a newly constructed benchmark called the Multimodal Visual Patterns (MMVP). The study reveals that these models struggle with basic visual details, often performing worse than random guessing. The authors propose a Mixture of Features (MoF) approach that integrates vision-centric representations to enhance visual grounding abilities, demonstrating that improved visual understanding can be achieved without sacrificing instruction-following capabilities. The findings underscore the importance of developing more robust visual representation learning methods and suggest that current scaling efforts are insufficient to address fundamental limitations in visual perception among LLMs.

FineParser: A Fine-grained Spatio-temporal Action Parser for Human-centric Action Quality Assessment

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Xu_FineParser_A_Fine-grained_Spatio-temporal_Action_Parser_for_Human-centric_Action_Quality_CVPR_2024_paper.pdf

Topics

Action Quality Assessment Fine-grained Parsing Spatio-temporal Analysis Human-centric Modeling Sports Video Analysis

Summary

The paper introduces FineParser, an advanced framework for fine-grained spatio-temporal action parsing aimed at enhancing action quality assessment (AQA) in human-centric contexts, particularly in sports like diving. Traditional AQA methods struggle with credibility and interpretability due to their coarse understanding of actions, which FineParser addresses by integrating four key components: a spatial action parser, a temporal action parser, a static visual encoder, and fine-grained contrastive regression. This model focuses on human-centric foreground action representations, allowing for precise assessments by minimizing the relevance of distracting backgrounds. The authors also present the FineDiving-HM dataset, featuring detailed human-centric action masks to foster improved evaluation of action quality. Extensive experiments demonstrate that FineParser significantly outperforms existing methods, showcasing its potential as a baseline for future AQA tasks requiring fine-grained action understanding.

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Xiao_Florence-2_Advancing_a_Unified_Representation_for_a_Variety_of_Vision_CVPR_2024_paper.pdf

Topics

vision foundation model multi-task learning comprehensive visual annotations spatial hierarchy semantic granularity

Summary

The paper presents Florence-2, an advanced vision foundation model designed for diverse computer vision and vision-language tasks through a unified, prompt-based representation. Unlike existing models, Florence-2 excels in performing various tasks with simple textual instructions by leveraging a large-scale dataset, FLD-5B, which comprises 5.4 billion annotations across 126 million images. This dataset was generated through an innovative iterative strategy combining automated image annotation and model refinement. The model utilizes a sequence-to-sequence architecture, allowing it to address complex tasks involving spatial hierarchy and semantic granularity without requiring task-specific modifications. Extensive evaluations demonstrate Florence-2's strong performance, achieving state-of-the-art results in zero-shot settings and after fine-tuning, showcasing its effectiveness as a versatile foundation model in the realm of artificial intelligence.

FlowIE: Efficient Image Enhancement via Rectified Flow

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Zhu_FlowIE_Efficient_Image_Enhancement_via_Rectified_Flow_CVPR_2024_paper.pdf

Topics

image enhancement rectified flow diffusion models inference speed deep learning

Summary

This paper presents FlowIE, a novel image enhancement framework that leverages conditioned rectified flow to efficiently enhance images affected by various degradations. Traditional methods often struggle with robustness and speed, especially under complex real-world conditions, whereas FlowIE constructs a many-to-one transport mapping that significantly reduces inference time—up to tenfold compared to existing diffusion-based approaches. By accurately estimating straight-line paths from low-quality to high-quality images, FlowIE effectively utilizes the rich generative knowledge from pre-trained diffusion models. Additionally, the introduction of mean value sampling enhances path estimation accuracy, leading to high-quality enhancement across tasks such as blind face restoration and super-resolution. Extensive experiments demonstrate FlowIE's competitive performance and versatility, establishing it as a promising solution for image restoration challenges.

FMA-Net: Flow-Guided Dynamic Filtering and Iterative Feature Refinement with Multi-Attention for Joint Video Super-Resolution and Deblurring

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Youk_FMA-Net_Flow-Guided_Dynamic_Filtering_and_Iterative_Feature_Refinement_with_Multi-Attention_CVPR_2024_paper.pdf

Topics

Video Super-Resolution Deblurring Dynamic Filtering Feature Refinement Machine Learning

Summary

The paper introduces FMA-Net, a novel framework for joint video super-resolution and deblurring (VSRDB), which employs flow-guided dynamic filtering (FGDF) and iterative feature refinement with multi-attention (FRMA). The FGDF allows for precise estimation of motion-aware degradation and restoration kernels, enhancing the model's ability to handle large motions effectively. The FRMA iteratively refines features through a coarse-to-fine approach, utilizing a new temporal anchor loss to stabilize training. Extensive experiments demonstrate that FMA-Net outperforms state-of-the-art methods in both quantitative and qualitative assessments across various datasets, providing significant improvements in video restoration quality.

FreeU: Free Lunch in Diffusion U-Net

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Si_FreeU_Free_Lunch_in_Diffusion_U-Net_CVPR_2024_paper.pdf

Topics

diffusion models U-Net architecture denoising process image generation FreeU

Summary

The paper presents FreeU, a novel method designed to enhance the performance of diffusion U-Nets, which are widely used in generative models for image and video synthesis. By analyzing the denoising process, the authors identify that the backbone of the U-Net is primarily responsible for denoising, while skip connections introduce high-frequency features. FreeU strategically re-weights these contributions through two scaling factors, significantly improving generation quality without requiring additional training or increasing computational costs. Extensive experiments demonstrate that FreeU can be easily integrated into existing diffusion models like Stable Diffusion and ModelScope, leading to superior image and video quality, thus showcasing its practical applicability in enhancing generative tasks.

From SAM to CAMs: Exploring Segment Anything Model for Weakly Supervised Semantic Segmentation

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Kweon_From_SAM_to_CAMs_Exploring_Segment_Anything_Model_for_Weakly_CVPR_2024_paper.pdf

Topics

Weakly Supervised Semantic Segmentation Segment Anything Model Class Activation Maps Semantic Segmentation Framework Computer Vision

Summary

This paper presents a novel framework, From-SAM-to-CAMs (S2C), for Weakly Supervised Semantic Segmentation (WSSS), which enhances the quality of Class Activation Maps (CAMs) by leveraging the Segment Anything Model (SAM) during training rather than just inference. The S2C framework consists of two main components: SAM-Segment Contrasting (SSC) and a CAM-based Prompting Module (CPM). SSC uses SAM's segmentation capabilities to create prototypes that guide feature learning in the classifier, while CPM refines CAMs into class-specific segmentation masks, aggregating these into a unified self-supervision mechanism. The proposed method significantly outperforms existing WSSS approaches across multiple benchmarks, demonstrating a robust ability to produce high-quality semantic segmentation maps.

Generative Image Dynamics

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Li_Generative_Image_Dynamics_CVPR_2024_paper.pdf

Topics

generative modeling image animation spectral volume motion prediction user interaction

Summary

This paper presents a novel approach for generating realistic animations from a single RGB image by modeling scene motion through a learned generative prior using spectral volumes. The method captures dense, long-term pixel trajectories in the Fourier domain, allowing for the transformation of still images into seamlessly looping videos and interactive simulations that respond to user inputs. The authors employ a frequency-coordinated latent diffusion model to predict spectral volumes, which are then converted into a motion texture for rendering future frames. The results demonstrate significant improvements in animation quality and coherence compared to previous methods, facilitating applications such as slow-motion effects and dynamic image interactions.

GPLD3D: Latent Diffusion of 3D Shape Generative Models by Enforcing Geometric and Physical Priors

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Dong_GPLD3D_Latent_Diffusion_of_3D_Shape_Generative_Models_by_Enforcing_CVPR_2024_paper.pdf

Topics

3D shape generation latent diffusion models geometric feasibility physical stability quality checker

Summary

The paper presents GPLD3D, a novel latent diffusion model aimed at enhancing the geometric feasibility and physical stability of generated 3D shapes. Traditional generative models often fail to accurately represent critical shape properties, leading to disconnected and unstable synthetic shapes. GPLD3D addresses these issues by incorporating a quality checker that evaluates the geometric feasibility and physical stability of shapes during the diffusion process. This quality checker employs learned scoring functions to assess shapes, allowing for a principled adjustment of trade-off parameters in the model. Comprehensive experiments on the ShapeNet-v2 dataset demonstrate that GPLD3D significantly outperforms existing state-of-the-art shape generators in both qualitative and quantitative metrics, showcasing its effectiveness in producing high-quality synthetic 3D shapes.

Grounding and Enhancing Grid-based Models for Neural Fields

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Zhao_Grounding_and_Enhancing_Grid-based_Models_for_Neural_Fields_CVPR_2024_paper.pdf

Topics

grid-based models neural fields generalization performance grid tangent kernels image and 3D reconstruction

Summary

This paper addresses the lack of a systematic analysis of grid-based models used for neural field representation by introducing a theoretical framework centered around grid tangent kernels (GTK). The authors demonstrate that GTKs are intrinsic properties that dictate the approximation and generalization behaviors of these models. Building on this theory, they present a novel grid-based model called the Multiplicative Fourier Adaptive Grid (MulFAGrid), which achieves superior generalization performance compared to existing models. Empirical results show that MulFAGrid excels in various tasks, including 2D image fitting, 3D signed distance field reconstruction, and novel view synthesis, indicating its robust representation capabilities and lower generalization bounds. The study offers insights that could guide the design of future grid-based models in computer vision and machine learning.

Image Processing GNN: Breaking Rigidity in Super-Resolution

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Tian_Image_Processing_GNN_Breaking_Rigidity_in_Super-Resolution_CVPR_2024_paper.pdf

Topics

Super-Resolution Graph Neural Networks Image Processing Neural Networks Degree Flexibility

Summary

The paper presents the Image Processing GNN (IPG) model, which addresses the limitations of traditional Super-Resolution (SR) methods that rely on rigid pixel aggregation through CNNs and attention mechanisms. By leveraging the flexibility of graph structures, IPG adapts to the unbalanced nature of SR tasks, where detail-rich areas require more reconstruction effort. The model introduces a degree-varying graph construction that assigns higher connectivity to detail-rich pixels and utilizes both local and global sampling strategies for efficient information aggregation. Experimental results demonstrate that IPG outperforms state-of-the-art SR models across several datasets, showcasing its effectiveness in producing high-resolution images while maintaining computational efficiency.

Improving Semantic Correspondence with Viewpoint-Guided Spherical Maps

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Mariotti_Improving_Semantic_Correspondence_with_Viewpoint-Guided_Spherical_Maps_CVPR_2024_paper.pdf

Topics

Semantic Correspondence Self-Supervised Learning 3D Geometry Spherical Mapping Evaluation Metrics

Summary

This paper presents a novel method for improving semantic correspondence estimation by integrating weak geometric understanding through spherical mapping, aimed at addressing the limitations of current self-supervised models in recognizing object symmetries and repeated parts. By leveraging a weak 3D prior and coarse viewpoint information, the proposed approach enhances the discriminative capability of learned representations without requiring extensive 3D supervision. The authors also introduce a new evaluation metric, Keypoint Average Precision (KAP), which better accounts for symmetry-related errors compared to traditional metrics. Experiments demonstrate that the method significantly outperforms existing techniques on various datasets, showcasing its effectiveness in distinguishing between similar object parts across different views.

In Search of a Data Transformation That Accelerates Neural Field Training

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Seo_In_Search_of_a_Data_Transformation_That_Accelerates_Neural_Field_CVPR_2024_paper.pdf

Topics

neural fields data transformation training acceleration optimization bias pixel permutation

Summary

This paper investigates how data transformations, specifically random pixel permutation (RPP), can significantly speed up the training of neural fields, a data representation paradigm that requires extensive optimization. The authors find that RPP accelerates the training process by removing easily-fitted patterns that typically slow down the later stages of training, allowing the network to focus on capturing high-frequency details more effectively. Through empirical studies across various datasets and architectures, the authors demonstrate that RPP consistently reduces the number of stochastic gradient descent (SGD) steps needed to achieve desired fidelity levels. Their findings reveal that while RPP initially leads to slower fitting of moderate PSNR levels, it enables rapid convergence to high PSNR levels, thus offering insights into optimizing neural field training by leveraging the optimization bias inherent in SGD.

Instruct-Imagen: Image Generation with Multi-modal Instruction

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Hu_Instruct-Imagen_Image_Generation_with_Multi-modal_Instruction_CVPR_2024_paper.pdf

Topics

image generation multi-modal instruction generalization diffusion models artificial intelligence

Summary

The paper introduces Instruct-Imagen, a novel image generation model designed to handle heterogeneous tasks by utilizing multi-modal instructions that integrate various modalities such as text, edges, and styles. The model employs a two-stage training approach, first adapting a pre-trained text-to-image diffusion model using retrieval-augmented training to enhance its capacity to ground generation on multi-modal context, followed by fine-tuning it on diverse image generation tasks. Instruct-Imagen demonstrates strong capabilities in understanding complex multi-modal instructions, achieving superior performance compared to prior state-of-the-art models in both in-domain and zero-shot tasks, effectively generalizing to unseen and complex image generation challenges. The research highlights the importance of multi-modal instruction in improving the model’s adaptability and generation accuracy across various tasks.

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Chen_InternVL_Scaling_up_Vision_Foundation_Models_and_Aligning_for_Generic_CVPR_2024_paper.pdf

Topics

vision-language models large-scale models multi-modal tasks visual perception generative training

Summary

This paper introduces InternVL, a large-scale vision-language foundation model that scales the vision encoder to 6 billion parameters and aligns it with a language middleware to enhance performance on various visual-linguistic tasks. The model employs a progressive alignment strategy utilizing web-scale image-text data for efficient training, demonstrating state-of-the-art results across 32 benchmarks, including zero-shot image classification, video classification, and multi-modal dialogue systems. InternVL bridges the gap between vision and large language models, showcasing its versatility and effectiveness in handling generic visual-linguistic tasks through a robust design that integrates contrastive and generative learning approaches.

LDP: Language-driven Dual-Pixel Image Defocus Deblurring Network

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Yang_LDP_Language-driven_Dual-Pixel_Image_Defocus_Deblurring_Network_CVPR_2024_paper.pdf

Topics

defocus deblurring dual-pixel imaging CLIP unsupervised learning blur map estimation

Summary

The paper presents a novel framework, LDP (Language-driven Dual-Pixel Image Defocus Deblurring Network), that leverages the contrastive language-image pre-training framework (CLIP) to estimate blur maps from dual-pixel (DP) image pairs without requiring additional data. The authors design specific text prompts to enable CLIP to understand and quantify blur-related geometric information, facilitating accurate blur map estimation. This estimated blur map is then utilized in a deblurring network featuring a blur-prior attention mechanism and specially formulated loss functions to restore sharp images from the DP pairs. The proposed method demonstrates state-of-the-art performance across multiple benchmark datasets, highlighting the effectiveness of combining language-driven approaches with low-level vision tasks.

Learning Structure-from-Motion with Graph Attention Networks

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Brynte_Learning_Structure-from-Motion_with_Graph_Attention_Networks_CVPR_2024_paper.pdf

Topics

Structure-from-Motion Graph Attention Networks 3D Reconstruction Deep Learning Bundle Adjustment

Summary

This paper presents a novel approach to learning Structure-from-Motion (SfM) using Graph Attention Networks (GATs), aiming to improve the initialization process traditionally reliant on iterative minimization techniques like Bundle Adjustment (BA). The proposed model processes 2D keypoints across multiple views to predict camera poses and 3D coordinates without the need for scene-specific training or fine-tuning, significantly enhancing inference speed and generalization to unseen scenes. Experimental results demonstrate that the GAT-based method outperforms existing learning-based techniques and approaches the performance of conventional SfM methods like COLMAP, while substantially reducing runtime. Additionally, the model shows resilience to outliers through data augmentation and outlier injection strategies, suggesting a promising direction for future research in efficient and robust 3D reconstruction.

Learning to Produce Semi-dense Correspondences for Visual Localization

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Giang_Learning_to_Produce_Semi-dense_Correspondences_for_Visual_Localization_CVPR_2024_paper.pdf

Topics

visual localization semi-dense correspondences 2D-3D matching Point Inference Network computer vision

Summary

This paper proposes a novel visual localization method called DeViLoc, which enhances the accuracy of camera pose estimation in challenging scenarios like nighttime and adverse weather by generating reliable semi-dense 2D-3D correspondences. Unlike existing techniques that depend heavily on predefined 3D feature points, DeViLoc utilizes a Point Inference Network (PIN) to regress observed and unobserved 2D keypoints into 3D coordinates. This method effectively aggregates matching information through a Confidence-based Point Aggregation (CPA) module, significantly improving performance in noisy conditions. Comprehensive evaluations show that DeViLoc outperforms state-of-the-art methods across multiple datasets, demonstrating its robustness and adaptability in practical applications.

Learning to Segment Referred Objects from Narrated Egocentric Videos

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Shen_Learning_to_Segment_Referred_Objects_from_Narrated_Egocentric_Videos_CVPR_2024_paper.pdf

Topics

Egocentric Videos Video Object Segmentation Weak Supervision Vision-Language Models Pixel-Level Grounding

Summary

This paper introduces a novel task called weakly-supervised Narration-based Video Object Segmentation (NVOS), which aims to segment object instances mentioned in the narration of egocentric videos without requiring spatial annotations during training. The proposed framework, named ROSA, utilizes vision-language models to establish pixel-level alignments between referred objects and segmentation mask proposals. The authors address the challenges posed by cluttered scenes and object occlusions by employing a Global-Local Contrastive Learning approach, which combines video-narration alignment with region-phrase similarities. To evaluate their method, they create a new benchmark dataset called VISOR-NVOS, consisting of detailed object-based narrations linked to existing segmentation masks. The results demonstrate that ROSA achieves state-of-the-art zero-shot grounding performance on egocentric video datasets, showcasing its effectiveness in fine-grained video-language understanding.

LISA: Reasoning Segmentation via Large Language Model

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Lai_LISA_Reasoning_Segmentation_via_Large_Language_Model_CVPR_2024_paper.pdf

Topics

reasoning segmentation large language models multimodal systems image segmentation benchmark dataset

Summary

The paper introduces a novel task called reasoning segmentation, which requires generating a binary segmentation mask from an implicit and complex text query related to an image. The authors propose LISA (Large Language Instructed Segmentation Assistant), a model that leverages the capabilities of multimodal large language models (LLMs) to handle such tasks effectively. LISA incorporates an additional token in its vocabulary to facilitate segmentation tasks and utilizes an embedding-as-mask paradigm for end-to-end training. The authors also establish a benchmark dataset, ReasonSeg, consisting of over one thousand image-instruction-mask samples, to evaluate model performance. Experimental results demonstrate that LISA excels in reasoning segmentation, showing significant improvements over existing methods, particularly through zero-shot capabilities and fine-tuning on a limited number of reasoning-specific samples.

LTGC: Long-tail Recognition via Leveraging LLMs-driven Generated Content

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Zhao_LTGC_Long-tail_Recognition_via_Leveraging_LLMs-driven_Generated_Content_CVPR_2024_paper.pdf

Topics

long-tail recognition generative models large language models data augmentation machine learning

Summary

The paper presents LTGC, a novel framework for long-tail recognition that leverages the knowledge of large language models (LLMs) to generate diverse content for tail categories, addressing challenges such as data scarcity and class imbalance. LTGC employs a two-step process where it first analyzes existing tail data to create a description list and then extends this list using LLMs to generate new, diverse tail-class descriptions. These descriptions are transformed into images using a text-to-image model, and an iterative evaluation module ensures the quality of generated images. The framework incorporates the BalanceMix module to effectively fine-tune the model with both generated and original data, significantly improving performance on long-tail benchmarks compared to existing methods. Experimental results demonstrate that LTGC outperforms state-of-the-art techniques, showcasing its effectiveness in enhancing long-tail recognition tasks.

Matching 2D Images in 3D: Metric Relative Pose from Metric Correspondences

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Barroso-Laguna_Matching_2D_Images_in_3D_Metric_Relative_Pose_from_Metric_CVPR_2024_paper.pdf

Topics

relative pose estimation keypoint matching 3D coordinates augmented reality weak supervision

Summary

The paper introduces MicKey, a novel pipeline for estimating metric relative pose between two 2D images by predicting 3D keypoint coordinates directly from the images without requiring depth measurements or knowledge of image overlap. By leveraging a differentiable pose estimation framework, MicKey establishes metric correspondences in camera space through an end-to-end learning strategy that utilizes only relative pose supervision. The approach outperforms state-of-the-art methods in the Map-Free Relocalization benchmark while requiring less supervision, proving its effectiveness in applications requiring precise pose estimation for augmented reality.

MemSAM: Taming Segment Anything Model for Echocardiography Video Segmentation

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Deng_MemSAM_Taming_Segment_Anything_Model_for_Echocardiography_Video_Segmentation_CVPR_2024_paper.pdf

Topics

echocardiography video segmentation machine learning medical imaging noise reduction

Summary

This paper introduces MemSAM, a novel model designed for segmenting echocardiography videos by adapting the Segment Anything Model (SAM) to address challenges unique to medical video data, such as speckle noise, ambiguous boundaries, and variability of objects across frames. MemSAM employs a temporal-aware and noise-resilient prompting scheme using a space-time memory that captures both spatial and temporal information to enhance segmentation accuracy. The model incorporates a memory reinforcement mechanism to improve memory quality before updates, thereby mitigating the effects of noise and artifacts. Evaluations on two publicly available datasets demonstrate that MemSAM achieves state-of-the-art performance with limited annotations, comparable to fully supervised approaches, showcasing its potential for automated echocardiographic assessments.

MetaCloak: Preventing Unauthorized Subject-driven Text-to-image Diffusion-based Synthesis via Meta-learning

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Liu_MetaCloak_Preventing_Unauthorized_Subject-driven_Text-to-image_Diffusion-based_Synthesis_via_Meta-learning_CVPR_2024_paper.pdf

Topics

text-to-image synthesis meta-learning data protection image perturbation privacy concerns

Summary

The paper introduces MetaCloak, a novel approach to protect user images from unauthorized subject-driven text-to-image diffusion-based synthesis, addressing vulnerabilities in existing methods that are ineffective against data transformations. By employing a meta-learning framework and a transformation sampling process, MetaCloak generates robust, model-agnostic perturbations that effectively distort the semantic integrity of personalized images. Extensive experiments on datasets like VGGFace2 and CelebA-HQ demonstrate that MetaCloak outperforms previous methods, successfully fooling online training services and providing a strong defense against unauthorized use while maintaining a high level of robustness under various data transformations.

Mip-Splatting: Alias-free 3D Gaussian Splatting

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Yu_Mip-Splatting_Alias-free_3D_Gaussian_Splatting_CVPR_2024_paper.pdf

Topics

3D Gaussian Splatting Novel View Synthesis Aliasing Image Rendering Frequency Constraints

Summary

The paper presents Mip-Splatting, a novel enhancement of 3D Gaussian Splatting (3DGS) aimed at addressing aliasing artifacts encountered during image rendering at varying sampling rates. By introducing a 3D smoothing filter that constrains the maximum frequency of 3D Gaussian primitives based on the input views and replacing the traditional 2D dilation filter with a 2D Mip filter to better simulate physical imaging processes, the authors effectively eliminate high-frequency artifacts and improve rendering fidelity. Experimental results demonstrate that Mip-Splatting significantly outperforms existing methods, particularly in out-of-distribution scenarios, thereby enhancing generalization to different camera poses and zoom levels.

MLP Can Be A Good Transformer Learner

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Lin_MLP_Can_Be_A_Good_Transformer_Learner_CVPR_2024_paper.pdf

Topics

Vision Transformers Attention Mechanism Computational Efficiency Entropy Analysis Model Compression

Summary

This paper presents a novel approach to simplify Vision Transformers by selectively removing non-essential attention layers based on entropy analysis. The authors argue that certain attention layers in lower blocks of the model carry less information and can be integrated into subsequent MLP layers without performance degradation. By employing an entropy-based selection strategy, termed NOSE, the method identifies which attention layers to prune in order to minimize the impact on overall model performance. Experimental results demonstrate that the proposed approach can reduce network parameters by 13.7% and improve throughput by 20.5% while maintaining competitive accuracy on tasks like image classification, showcasing its potential for enhancing computational efficiency in Vision Transformers.

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Yue_MMMU_A_Massive_Multi-discipline_Multimodal_Understanding_and_Reasoning_Benchmark_for_CVPR_2024_paper.pdf

Topics

multimodal reasoning artificial general intelligence benchmark evaluation domain-specific knowledge expert performance

Summary

The paper presents the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark, designed to assess the capabilities of large multimodal models (LMMs) in handling complex, college-level tasks that require expert knowledge across six disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. The benchmark comprises 11,500 questions featuring 30 heterogeneous image types and interleaved text, emphasizing advanced reasoning and perception aligned with expert-level performance. Evaluations of several models, including GPT-4V and Gemini, reveal significant challenges, with accuracies around 56% and 59%, respectively, underscoring the gap between current AI capabilities and expert-level reasoning. The MMMU aims to stimulate further research towards achieving expert artificial general intelligence by highlighting the complexities of multimodal understanding and the necessity for deep domain knowledge.

Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Lee_Modeling_Multimodal_Social_Interactions_New_Challenges_and_Baselines_with_Densely_CVPR_2024_paper.pdf

Topics

multimodal interactions social deduction games language-visual alignment social behavior analysis coreference resolution

Summary

This paper addresses the complexities of modeling multimodal social interactions, particularly in multi-party environments, where both verbal and non-verbal cues are crucial for understanding social dynamics. The authors introduce three new tasks—speaking target identification, pronoun coreference resolution, and mentioned player prediction—within the context of social deduction games, accompanied by extensive dataset annotations. They propose a novel baseline model that utilizes densely aligned language-visual representations, allowing for a synchronized analysis of verbal utterances and corresponding visual features. Experimental results demonstrate the effectiveness of this approach, showcasing significant improvements over existing methods by capturing the intricate interactions among multiple speakers and their gestures. The paper contributes to the field by providing new tasks, a robust dataset, and a baseline model, thus facilitating further research in multimodal social interaction analysis.

MonoHair: High-Fidelity Hair Modeling from a Monocular Video

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Wu_MonoHair_High-Fidelity_Hair_Modeling_from_a_Monocular_Video_CVPR_2024_paper.pdf

Topics

hair modeling monocular video 3D reconstruction computer graphics data-driven methods

Summary

The paper presents MonoHair, a novel framework for high-fidelity 3D hair modeling from monocular videos, addressing limitations of existing methods that require strict capture conditions or rely heavily on learned prior data. The approach consists of two main stages: the first involves precise exterior hair geometry reconstruction using a Patch-based Multi-View Optimization (PMVO) method, which integrates information from multiple views without prior data dependence. The second stage infers the interior hair structure using a data-driven multi-view reconstruction technique, enhancing accuracy by aligning with 2D structural renderings derived from the exterior geometry. Experimental results demonstrate that MonoHair robustly reconstructs diverse hairstyles, including curly hair, and achieves state-of-the-art performance with significant efficiency improvements over previous methods.

MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Jiang_MultiPly_Reconstruction_of_Multiple_People_from_Monocular_Video_in_the_CVPR_2024_paper.pdf

Topics

3D reconstruction monocular video human instance segmentation machine learning occlusion handling

Summary

The paper introduces MultiPly, a novel framework designed to reconstruct multiple individuals in 3D from monocular videos captured in real-world settings. The framework addresses significant challenges such as occlusions and close human interactions, which complicate the accurate 3D modeling of multiple subjects. MultiPly employs a layered neural representation to separate individual human models and background, utilizing layer-wise differentiable volume rendering to learn from video data. Additionally, it incorporates a hybrid instance segmentation approach and a confidence-guided optimization strategy to ensure high fidelity and temporal consistency in the reconstructions. Evaluation results demonstrate MultiPly's superiority over existing methods in various tasks, including human reconstruction and pose estimation, particularly in complex scenarios with severe occlusions.

NeRF-HuGS: Improved Neural Radiance Fields in Non-static Scenes Using Heuristics-Guided Segmentation

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Chen_NeRF-HuGS_Improved_Neural_Radiance_Fields_in_Non-static_Scenes_Using_Heuristics-Guided_CVPR_2024_paper.pdf

Topics

NeRF Heuristics Segmentation 3D Reconstruction Dynamic Scenes

Summary

The paper presents a novel approach called Heuristics-Guided Segmentation (HuGS) to enhance Neural Radiance Fields (NeRF) in non-static scenes, addressing the challenges posed by transient distractors like moving objects and shadows. The proposed method combines hand-crafted heuristics with advanced segmentation techniques to effectively differentiate static elements from transient ones, improving the quality of 3D scene reconstruction. By integrating Structure-from-Motion (SfM)-based heuristics and color residual heuristics, HuGS achieves accurate static vs. transient separations across diverse textures. Extensive experiments demonstrate its robustness and superiority over existing methods in mitigating artifacts in NeRF trained on non-static scenes, showcasing significant improvements in view synthesis quality.

Neural Lineage

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Yu_Neural_Lineage_CVPR_2024_paper.pdf

Topics

neural lineage detection model relationships fine-tuning similarity metrics deep learning models

Summary

This paper introduces a novel task called neural lineage detection, which aims to identify the parent-child relationships between neural network models based on fine-tuning processes. Two methodologies are proposed: a learning-free approach that integrates an approximation of fine-tuning into similarity metrics for lineage detection and a learning-based method utilizing a transformer architecture for enhanced accuracy. Experimental results demonstrate that both methods outperform existing baseline techniques across various model architectures and tasks, including classification, segmentation, and detection, while also effectively tracing cross-generational lineage. The paper highlights the significance of understanding model relationships for applications in model reuse, intellectual property protection, and accountability in deep learning.

Neural Redshift: Random Networks are not Random Functions

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Teney_Neural_Redshift_Random_Networks_are_not_Random_Functions_CVPR_2024_paper.pdf

Topics

neural networks generalization inductive bias architectures complexity

Summary

The paper "Neural Redshift: Random Networks are not Random Functions" investigates the generalization capabilities of neural networks (NNs) by examining untrained, random-weight networks to identify inductive biases independent of gradient descent. The authors find that architectures, particularly those using ReLU activations, exhibit strong biases toward low-complexity functions, which align with real-world data patterns. This phenomenon, termed "Neural Redshift," suggests that the effectiveness of NNs is not an inherent property but rather a result of suitable architectural choices. The study provides a fresh perspective on deep learning's success, emphasizing the importance of understanding the complexities of neural network architectures and their implications for training and generalization across various tasks.

NoiseCLR: A Contrastive Learning Approach for Unsupervised Discovery of Interpretable Directions in Diffusion Models

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Dalva_NoiseCLR_A_Contrastive_Learning_Approach_for_Unsupervised_Discovery_of_Interpretable_CVPR_2024_paper.pdf

Topics

contrastive learning diffusion models image editing unsupervised learning semantic directions

Summary

This paper presents NoiseCLR, an unsupervised contrastive learning framework designed to discover interpretable directions in text-to-image diffusion models, specifically targeting Stable Diffusion. Unlike existing methods that rely on textual prompts or labeled data, NoiseCLR identifies semantically meaningful directions using a small set of unlabeled images from various domains such as faces, cats, and art. The approach facilitates highly disentangled image edits, allowing for simultaneous modifications within a single domain or across multiple domains without interference. Extensive experiments demonstrate that NoiseCLR outperforms existing diffusion-based and GAN-based image editing techniques, enhancing both control and transparency in the generative process while addressing potential biases inherent in large models.

Objects as Volumes: A Stochastic Geometry View of Opaque Solids

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Miller_Objects_as_Volumes_A_Stochastic_Geometry_View_of_Opaque_Solids_CVPR_2024_paper.pdf

Topics

opaque solids volumetric representation stochastic geometry light transport 3D reconstruction

Summary

This paper presents a theoretical framework for modeling opaque solids as volumetric entities using stochastic geometry. The authors establish conditions under which opaque solids can be represented via exponential volumetric transport and derive expressions for the volumetric attenuation coefficient based on probability distributions of underlying indicator functions. The study extends traditional volumetric representations to accommodate both isotropic and anisotropic scattering behaviors and introduces stochastic implicit surface representations. By rigorously deriving their model from first principles, the authors ensure compliance with physical constraints such as reciprocity and reversibility. Experimental results demonstrate significant improvements in 3D reconstruction tasks when employing their proposed volumetric representation compared to previous methods. The findings suggest a robust foundation for future research in volumetric modeling of opaque solids in computer graphics and applied physics.

PaSCo: Urban 3D Panoptic Scene Completion with Uncertainty Awareness

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Cao_PaSCo_Urban_3D_Panoptic_Scene_Completion_with_Uncertainty_Awareness_CVPR_2024_paper.pdf

Topics

Panoptic Scene Completion Uncertainty Estimation 3D Scene Understanding Deep Learning Robotics

Summary

The paper introduces Panoptic Scene Completion (PSC), an advanced method for understanding 3D scenes by integrating geometry, semantics, and instance-level predictions from sparse input data. The proposed technique, PaSCo, employs a hybrid mask-based approach leveraging a multi-input multi-output (MIMO) strategy to enhance performance and provide uncertainty estimates at both voxel and instance levels, crucial for applications in robotics and autonomous driving. Experimental results demonstrate that PaSCo surpasses existing methods in both PSC accuracy and uncertainty estimation across three large-scale urban datasets, showcasing its effectiveness and efficiency in completing and interpreting 3D scenes.

pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Charatan_pixelSplat_3D_Gaussian_Splats_from_Image_Pairs_for_Scalable_Generalizable_CVPR_2024_paper.pdf

Topics

3D Reconstruction Gaussian Splatting Novel View Synthesis Machine Learning Differentiable Rendering

Summary

The paper presents pixelSplat, a novel feed-forward model for 3D reconstruction that utilizes pairs of images to learn and generate 3D radiance fields defined by 3D Gaussian primitives. This approach achieves real-time and memory-efficient rendering while overcoming local minima issues associated with sparse representations by predicting a dense probability distribution for Gaussian means. The model employs a differentiable sampling technique that allows gradient backpropagation through the Gaussian representation, leading to significant performance improvements over existing methods, particularly in terms of rendering speed and resource efficiency. Extensive benchmarking on datasets like RealEstate10k and ACID demonstrates that pixelSplat outperforms state-of-the-art light field transformers, producing interpretable and editable 3D representations while drastically reducing training and inference costs.

PlatoNeRF: 3D Reconstruction in Plato's Cave via Single-View Two-Bounce Lidar

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Klinghoffer_PlatoNeRF_3D_Reconstruction_in_Platos_Cave_via_Single-View_Two-Bounce_Lidar_CVPR_2024_paper.pdf

Topics

3D Reconstruction NeRF Lidar Two-Bounce Signals Single-View Geometry

Summary

The paper presents PlatoNeRF, a novel approach for 3D scene reconstruction from a single view using two-bounce signals captured by single-photon lidar. Unlike traditional NeRF methods that rely on multiple views, PlatoNeRF leverages time-of-flight data to accurately model occluded geometry without relying on data priors or controlled lighting conditions. The method combines the strengths of neural radiance fields and two-bounce lidar data, enabling it to reconstruct both visible and hidden geometry effectively. Experimental results demonstrate that PlatoNeRF outperforms existing methods in terms of accuracy and robustness, particularly under varying sensor constraints and scene properties, making it a promising solution for applications in autonomous systems and extended reality.

Point Transformer V3: Simpler Faster Stronger

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Wu_Point_Transformer_V3_Simpler_Faster_Stronger_CVPR_2024_paper.pdf

Topics

Point Cloud Processing Transformer Architecture 3D Perception Model Efficiency Semantic Segmentation

Summary

The paper introduces Point Transformer V3 (PTv3), a novel model designed to enhance point cloud processing by prioritizing simplicity and efficiency over complex design elements. PTv3 significantly expands the receptive field from 16 to 1024 points, achieving a 3.3 times increase in processing speed and a 10.2 times reduction in memory consumption compared to its predecessor, PTv2. Utilizing a serialization-based approach, PTv3 effectively organizes unstructured point clouds, allowing for improved performance in over 20 downstream tasks across both indoor and outdoor scenarios. The model demonstrates state-of-the-art results in semantic segmentation and object detection while maintaining low latency, making it suitable for real-time applications. Overall, PTv3 embodies a shift towards scalability in model design, emphasizing the importance of efficient processing in 3D perception tasks.

Producing and Leveraging Online Map Uncertainty in Trajectory Prediction

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Gu_Producing_and_Leveraging_Online_Map_Uncertainty_in_Trajectory_Prediction_CVPR_2024_paper.pdf

Topics

trajectory prediction online mapping uncertainty estimation autonomous vehicles HD maps

Summary

The paper discusses enhancements to online high-definition (HD) map estimation methods for autonomous vehicles (AVs) by incorporating uncertainty estimates, which are critical for improving trajectory prediction. Traditional mapping approaches lack confidence measures, causing potential errors in downstream tasks. The authors extend existing state-of-the-art methods to output uncertainty alongside map data, demonstrating that this integration leads to a 50% faster training convergence and a 15% improvement in prediction accuracy, as validated on the nuScenes dataset. The proposed framework captures various sources of uncertainty, such as occlusions and sensor range, and shows significant improvements in the performance of trajectory prediction models when leveraging this enhanced mapping information.

Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Feng_Ranni_Taming_Text-to-Image_Diffusion_for_Accurate_Instruction_Following_CVPR_2024_paper.pdf

Topics

text-to-image synthesis diffusion models semantic panel interactive editing large language models

Summary

The paper presents Ranni, an innovative framework designed to enhance text-to-image (T2I) diffusion models' ability to interpret and respond to complex prompts. Ranni leverages a semantic panel, a structured middleware that organizes visual concepts parsed from input text using large language models (LLMs). This approach addresses common challenges in T2I synthesis, such as object quantity, attribute binding, and spatial relationships. By dividing the generation process into text-to-panel and panel-to-image tasks, Ranni improves textual controllability and facilitates intuitive image editing through user-friendly operations. The framework supports both manual and LLM-assisted editing, demonstrating significant advancements in prompt following accuracy, interactive generation, and continuous refinement of images based on user instructions. Overall, Ranni represents a notable step forward in creating flexible, chat-based image generation systems.

Relightable Gaussian Codec Avatars

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Saito_Relightable_Gaussian_Codec_Avatars_CVPR_2024_paper.pdf

Topics

relightable avatars real-time rendering 3D Gaussians appearance modeling dynamic facial animation

Summary

The paper presents Relightable Gaussian Codec Avatars, a novel approach for creating high-fidelity, relightable 3D head avatars that can be animated in real-time using 3D Gaussians and learnable radiance transfer. This method addresses the challenges of accurately modeling complex human head materials, including skin, hair, and eyes, by employing a unified appearance model that supports all-frequency reflections and diverse materials. The use of 3D Gaussians allows for detailed geometric representations, particularly of intricate structures like hair, while the learnable radiance transfer facilitates efficient relighting under various illumination conditions. The approach excels in real-time performance, demonstrating significant improvements over existing methods in terms of both visual fidelity and computational efficiency, particularly in applications such as gaming and telecommunication.

Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Ke_Repurposing_Diffusion-Based_Image_Generators_for_Monocular_Depth_Estimation_CVPR_2024_paper.pdf

Topics

Monocular Depth Estimation Diffusion Models Fine-Tuning Protocol Zero-Shot Generalization Computer Vision

Summary

This paper introduces Marigold, a novel method for monocular depth estimation that repurposes the capabilities of diffusion-based image generators, specifically leveraging the Stable Diffusion model. By fine-tuning this model with synthetic data, Marigold achieves state-of-the-art performance in depth estimation across various datasets, even in zero-shot scenarios where it encounters unfamiliar content. The approach emphasizes utilizing the extensive visual knowledge embedded in generative models to enhance generalizability and accuracy in depth estimation tasks. The results demonstrate significant improvements in depth estimation quality, highlighting the potential of combining generative modeling techniques with depth estimation frameworks.

Rethinking Inductive Biases for Surface Normal Estimation

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Bae_Rethinking_Inductive_Biases_for_Surface_Normal_Estimation_CVPR_2024_paper.pdf

Topics

surface normal estimation inductive biases deep learning computer vision generalization

Summary

This paper addresses the challenges in surface normal estimation from single RGB images by proposing new inductive biases tailored for this task. The authors suggest utilizing per-pixel ray direction and modeling the relative rotational relationships between neighboring surface normals. Their method incorporates these biases into a deep learning framework, allowing for piecewise smooth predictions that maintain detail, even in complex, out-of-distribution images. The results demonstrate that their approach outperforms state-of-the-art methods, particularly in terms of generalization ability, despite being trained on a substantially smaller dataset. The model's architecture is designed to work with images of arbitrary resolution and aspect ratio, making it suitable for various computer vision applications.

Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Horita_Retrieval-Augmented_Layout_Transformer_for_Content-Aware_Layout_Generation_CVPR_2024_paper.pdf

Topics

layout generation retrieval augmentation deep learning graphic design autoregressive models

Summary

The paper presents the Retrieval-Augmented Layout Transformer (RALF), a novel approach for content-aware layout generation that addresses the limitations of existing methods due to data scarcity. By incorporating retrieval augmentation, RALF enhances the layout generation process by retrieving similar layout examples based on input images and integrating these references into an autoregressive model. The model demonstrates superior performance in generating high-quality layouts across various tasks, achieving significant improvements over baseline methods with less training data required. Extensive experiments validate RALF's capability to produce aesthetically pleasing and contextually appropriate layouts, making it a promising tool for graphic design applications.

Rich Human Feedback for Text-to-Image Generation

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Liang_Rich_Human_Feedback_for_Text-to-Image_Generation_CVPR_2024_paper.pdf

Topics

Text-to-Image Generation Human Feedback Image Quality Improvement Multimodal Transformers Dataset Creation

Summary

This paper presents a novel approach to enhancing text-to-image (T2I) generation by introducing a rich human feedback dataset (RichHF-18K) composed of 18,000 annotated images. The dataset includes detailed annotations marking implausible regions and misaligned text prompts, alongside fine-grained scores for various quality aspects (plausibility, alignment, aesthetics, and overall quality). The authors develop a multimodal transformer model, Rich Automatic Human Feedback (RAHF), which predicts these detailed feedback annotations, demonstrating significant improvements in image generation quality through training and inpainting techniques. The study reveals that the predicted feedback enhances generative models like Muse, showcasing the potential of rich human feedback in refining T2I outputs and setting a foundation for future research in this area.

RoHM: Robust Human Motion Reconstruction via Diffusion

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Zhang_RoHM_Robust_Human_Motion_Reconstruction_via_Diffusion_CVPR_2024_paper.pdf

Topics

human motion reconstruction diffusion models 3D motion occlusion handling machine learning

Summary

The paper presents RoHM, a robust method for 3D human motion reconstruction from monocular RGB(-D) videos, specifically designed to handle noisy and occluded inputs. Unlike previous methods that either directly regress 3D motion or utilize time-consuming optimization techniques, RoHM leverages diffusion models to iteratively denoise and infill motion data, achieving globally coherent motion representation. The approach is structured around two separate models for global trajectory and local motion, enhanced by a flexible conditioning module to capture their interdependencies. Extensive experiments demonstrate that RoHM significantly outperforms state-of-the-art techniques in accuracy and physical plausibility, while also being 30 times faster during inference. The method's versatility is validated across diverse datasets, making it a promising advancement in the field of human motion reconstruction.

S2MAE: A Spatial-Spectral Pretraining Foundation Model for Spectral Remote Sensing Data

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Li_S2MAE_A_Spatial-Spectral_Pretraining_Foundation_Model_for_Spectral_Remote_Sensing_CVPR_2024_paper.pdf

Topics

spectral remote sensing pretraining self-supervised learning 3D transformer masked autoencoders

Summary

This paper presents S2MAE, a specialized pre-trained model designed for spectral remote sensing (RS) data, addressing the inadequacies of existing models that primarily focus on RGB imagery. S2MAE utilizes a 3D transformer architecture with a 90% masking ratio to capture local spectral consistency and spatial invariance, effectively leveraging large unlabeled spectral datasets through progressive pretraining. The model's efficacy is validated across three downstream tasks, demonstrating superior performance in single and multi-label classification as well as change detection, significantly outperforming existing methods. Through extensive ablation studies, the research highlights the importance of a high masking ratio and the need for tailored masking strategies to enhance representation learning in spectral imagery.

SAFDNet: A Simple and Effective Network for Fully Sparse 3D Object Detection

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Zhang_SAFDNet_A_Simple_and_Effective_Network_for_Fully_Sparse_3D_CVPR_2024_paper.pdf

Topics

3D Object Detection LiDAR Sparse Networks Adaptive Feature Diffusion Autonomous Driving

Summary

This paper introduces SAFDNet, a novel fully sparse adaptive feature diffusion network designed for LiDAR-based 3D object detection, which addresses the computational inefficiencies associated with dense feature maps in existing models. SAFDNet employs an adaptive feature diffusion strategy to mitigate the center feature missing problem prevalent in fully sparse detectors. Experimental results demonstrate that SAFDNet outperforms previous state-of-the-art methods, achieving superior accuracy on long-range detection tasks, particularly on the Argoverse2 dataset, while also maintaining faster inference speeds compared to hybrid and other sparse detectors. The architecture's design allows for straightforward adaptation to various scenarios, making it a promising approach for enhancing performance in autonomous driving applications.

SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Delitzas_SceneFun3D_Fine-Grained_Functionality_and_Affordance_Understanding_in_3D_Scenes_CVPR_2024_paper.pdf

Topics

3D Scene Understanding Functional Interaction Affordance Annotation Motion Estimation Natural Language Processing

Summary

The paper introduces SceneFun3D, a novel large-scale dataset that provides over 14,800 fine-grained functional interaction annotations across 710 high-resolution 3D indoor scenes. This dataset includes detailed motion parameters and diverse natural language task descriptions, aimed at enhancing 3D scene understanding beyond traditional semantic segmentation. The authors define three new tasks—functionality segmentation, task-driven affordance grounding, and 3D motion estimation—to evaluate model performance. Their findings indicate that existing methods struggle with accurately detecting and interacting with functional elements in real-world scenarios, highlighting the need for improved understanding of affordances and interaction dynamics in 3D environments.

Seeing the World through Your Eyes

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Alzayer_Seeing_the_World_through_Your_Eyes_CVPR_2024_paper.pdf

Topics

eye reflections radiance field reconstruction pose optimization iris texture decomposition 3D scene reconstruction

Summary

This paper presents a novel approach to reconstructing a radiance field of a scene observed by a person using only the reflections from their eyes in a sequence of images taken from a stationary camera. The authors address challenges such as accurately estimating eye poses and separating the complex iris textures from scene reflections. They introduce a method that optimizes cornea poses and incorporates a regularization prior for iris texture to enhance the quality of scene reconstruction. The approach is validated through extensive experiments on both synthetic and real-world datasets, demonstrating its effectiveness in recovering detailed 3D scenes from eye reflections, even under various lighting conditions.

Semantic Human Mesh Reconstruction with Textures

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Zhan_Semantic_Human_Mesh_Reconstruction_with_Textures_CVPR_2024_paper.pdf

Topics

Semantic mesh reconstruction Textures generation Monocular image processing Facial animation 3D modeling

Summary

The paper presents SHERT, a novel framework for semantic human mesh reconstruction that effectively combines geometric detail and texture generation from monocular images. SHERT addresses challenges faced by existing methods, such as unstable results and low-quality meshes, by employing a pipeline that includes semantic- and normal-based sampling, self-supervised mesh completion and refinement, and a diffusion model for texture generation driven by both images and text prompts. The framework ensures high-quality triangle meshes, stable UV unwrapping, and the ability to animate and substitute different body parts, demonstrating superior performance compared to state-of-the-art techniques in both quantitative and qualitative experiments.

Specularity Factorization for Low-Light Enhancement

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Saini_Specularity_Factorization_for_Low-Light_Enhancement_CVPR_2024_paper.pdf

Topics

Low-Light Enhancement Image Factorization Specularity Estimation Recursive Networks Image Processing

Summary

This paper introduces Recursive Specularity Factorization (RSF), a novel technique for low-light image enhancement that decomposes images into multiple additive specular components, allowing for controllable image relighting and improved enhancement tasks. By employing a model-driven RSFNet, the authors achieve zero-reference low-light enhancement without the need for paired training data, demonstrating superior performance on various benchmarks. The RSF method utilizes recursive estimation of sparsity thresholds to effectively separate specular factors, which can be used not only for low-light enhancement but also for other applications like dehazing, deraining, and deblurring. The results indicate that RSFNet outperforms existing state-of-the-art methods, showcasing high generalizability across diverse datasets.

SpiderMatch: 3D Shape Matching with Global Optimality and Geometric Consistency

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Roetzer_SpiderMatch_3D_Shape_Matching_with_Global_Optimality_and_Geometric_Consistency_CVPR_2024_paper.pdf

Topics

3D Shape Matching Geometric Consistency Integer Linear Programming SpiderCurve Global Optimality

Summary

The paper "SpiderMatch" presents a novel approach for 3D shape matching that achieves global optimality and geometric consistency by utilizing a new representation called SpiderCurve, which is a self-intersecting curve tracing the surface of a 3D shape. The authors tackle the 3D shape matching problem by formulating it as an integer linear programming (ILP) problem, and they introduce constraints to maintain geometric consistency during the matching process. Their method is evaluated against existing state-of-the-art approaches and demonstrates competitive performance while ensuring that matches are both geometrically consistent and optimal. The experimental results indicate that their approach significantly outperforms previous methods, particularly in preserving neighborhood relationships between shape elements.

Steerers: A Framework for Rotation Equivariant Keypoint Descriptors

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Bokman_Steerers_A_Framework_for_Rotation_Equivariant_Keypoint_Descriptors_CVPR_2024_paper.pdf

Topics

rotation equivariance keypoint descriptors 3D reconstruction image matching steerers

Summary

The paper introduces a novel framework called "steerers" designed for rotation equivariant keypoint descriptors, which enhances the robustness of learned image descriptors against large rotations while maintaining performance on upright images. Traditional learned descriptors struggle with rotation invariance, leading to either a loss of discriminative power or increased computational costs through test-time augmentation. The steerers function as linear transforms that adjust keypoint descriptors to simulate the effects of image rotation without needing to reprocess the images. The authors explore three optimization settings for steerers—fixing the descriptor, jointly optimizing both descriptor and steerer, and optimizing the descriptor while fixing the steerer—and demonstrate that their approach achieves state-of-the-art results on benchmarks like AIMS and Roto-360, while also performing competitively on non-rotated images in MegaDepth. The paper contributes to the theoretical understanding of steerers and their practical applications in improving image matching tasks in various domains, including 3D reconstruction and space applications.

Stratified Avatar Generation from Sparse Observations

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Feng_Stratified_Avatar_Generation_from_Sparse_Observations_CVPR_2024_paper.pdf

Topics

avatar generation sparse observations human motion reconstruction stratified approach kinematic modeling

Summary

The paper presents a novel method called Stratified Avatar Generation (SAGE) for reconstructing 3D full-body avatars from sparse observations, primarily using input from head-mounted devices that track only the head and hands. The authors address the challenges of accurately predicting lower body movements from limited data by employing a two-stage approach that first reconstructs the upper body and subsequently infers the lower body conditioned on the upper body’s output. Utilizing a disentangled body representation based on the Skinned Multi-Person Linear (SMPL) model, the method incorporates a latent diffusion model to represent and generate motion sequences more effectively. Extensive experiments demonstrate that SAGE outperforms existing state-of-the-art methods, particularly in lower-body motion estimation, showcasing its potential for enhancing immersive experiences in AR/VR applications.

Style Aligned Image Generation via Shared Attention

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Hertz_Style_Aligned_Image_Generation_via_Shared_Attention_CVPR_2024_paper.pdf

Topics

Text-to-Image Generation Style Consistency Attention Mechanism Diffusion Models Image Generation Techniques

Summary

The paper presents StyleAligned, a new method for ensuring style consistency in images generated by large-scale Text-to-Image (T2I) models, which traditionally struggle to maintain uniform style across outputs. StyleAligned utilizes minimal attention sharing during the diffusion process, allowing generated images to adhere to a reference style without requiring optimization or fine-tuning. The technique was evaluated against various styles and prompts, demonstrating superior style consistency and visual coherence compared to existing methods, while also maintaining high-quality image synthesis. By leveraging adaptive normalization in attention layers, StyleAligned effectively balances diversity and style adherence, paving the way for practical applications in creative domains.

Task-Driven Wavelets using Constrained Empirical Risk Minimization

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Marcus_Task-Driven_Wavelets_using_Constrained_Empirical_Risk_Minimization_CVPR_2024_paper.pdf

Topics

wavelets deep learning constrained optimization medical imaging empirical risk minimization

Summary

This paper presents a novel framework called Constrained Empirical Risk Minimization (CERM) for optimizing wavelets within deep learning architectures, specifically Convolutional Neural Networks (CNNs). The authors tackle the challenge of enforcing strict structural constraints on the network's convolutional filters to ensure they conform to wavelet properties, which is critical in applications like medical imaging, where accurate contour prediction is paramount. By using CERM, the filters are optimized to become task-specific wavelets during training, addressing limitations of traditional loss function-based constraints. Empirical evaluations demonstrate that the proposed wavelet networks significantly outperform baseline methods in contour prediction tasks on medical datasets, showcasing their efficacy in leveraging wavelet properties for enhanced performance in specialized applications.

Temporally Consistent Unbalanced Optimal Transport for Unsupervised Action Segmentation

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Xu_Temporally_Consistent_Unbalanced_Optimal_Transport_for_Unsupervised_Action_Segmentation_CVPR_2024_paper.pdf

Topics

action segmentation optimal transport unsupervised learning temporal consistency Gromov-Wasserstein

Summary

This paper presents a novel method called Action Segmentation Optimal Transport (ASOT) for unsupervised action segmentation in long, untrimmed videos, focusing on achieving temporal consistency without requiring prior knowledge of action order. By formulating the task as a fused unbalanced Gromov-Wasserstein optimal transport problem, ASOT effectively decodes segmentations from a noisy affinity cost matrix between video frames and action classes. The method is evaluated across various datasets, including Breakfast and YouTube Instructions, demonstrating state-of-the-art results in unsupervised learning settings. ASOT also serves as a post-processing tool that enhances the performance of existing supervised methods. The approach addresses limitations of previous methods that enforce balanced assignments and rely on known action orderings, thus providing a more flexible and robust framework for action segmentation tasks.

Towards Robust Event-guided Low-Light Image Enhancement: A Large-Scale Real-World Event-Image Dataset and Novel Approach

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Liang_Towards_Robust_Event-guided_Low-Light_Image_Enhancement_A_Large-Scale_Real-World_Event-Image_CVPR_2024_paper.pdf

Topics

event-guided image enhancement low-light imaging dataset creation signal-to-noise ratio deep learning

Summary

This paper presents a novel approach for low-light image enhancement (LIE) using a new large-scale dataset called SDE, which contains over 30,000 spatially and temporally aligned pairs of images and events captured under varying illumination conditions. The authors developed a robotic system to ensure high precision in data collection and introduced an event-guided LIE framework named EvLight that integrates features from both images and events. Their method employs a signal-to-noise ratio (SNR)-guided strategy for selective feature fusion, enhancing robustness against illumination variations and noise. Experimental results demonstrate that EvLight significantly outperforms existing frame-based and event-guided methods, highlighting its effectiveness in improving low-light image quality.

Transcriptomics-guided Slide Representation Learning in Computational Pathology

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Jaume_Transcriptomics-guided_Slide_Representation_Learning_in_Computational_Pathology_CVPR_2024_paper.pdf

Topics

slide representation learning transcriptomics self-supervised learning computational pathology multimodal pre-training

Summary

This paper presents TANGLE, a novel framework that employs transcriptomics-guided slide representation learning to improve the processing and classification of giga-pixel whole-slide images (WSIs) in computational pathology. By leveraging gene expression profiles alongside histology slides, TANGLE utilizes multimodal pre-training to create robust slide embeddings that significantly enhance few-shot classification, prototype-based classification, and slide retrieval tasks across multiple datasets involving human and rat tissues. The study demonstrates that TANGLE outperforms existing self-supervised and supervised learning baselines, showcasing its potential for improved diagnostic capabilities in pathology.

Tri-Perspective View Decomposition for Geometry-Aware Depth Completion

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Yan_Tri-Perspective_View_Decomposition_for_Geometry-Aware_Depth_Completion_CVPR_2024_paper.pdf

Topics

Depth Completion 3D Geometry Tri-Perspective View Decomposition Geometric Spatial Propagation Network Time-of-Flight Sensor

Summary

The paper presents a novel framework called Tri-Perspective View Decomposition (TPVD) to enhance depth completion, an essential task in autonomous driving that involves reconstructing dense depth maps from sparse measurements. Unlike traditional methods that primarily utilize 2D representations or directly incorporate 3D point clouds, TPVD decomposes the 3D point cloud into three distinct 2D views, effectively allowing for the densification of sparse depth inputs while preserving 3D geometric information. The framework employs a TPV Fusion mechanism for recurrent 2D-3D-2D feature aggregation and introduces a Distance-Aware Spherical Convolution for improved geometric consistency. The proposed method outperforms existing state-of-the-art approaches on several benchmark datasets, including KITTI, NYUv2, and SUN RGBD, and contributes a new depth completion dataset, TOFDC, acquired using mobile time-of-flight sensors.

UnO: Unsupervised Occupancy Fields for Perception and Forecasting

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Agro_UnO_Unsupervised_Occupancy_Fields_for_Perception_and_Forecasting_CVPR_2024_paper.pdf

Topics

unsupervised learning occupancy forecasting self-driving vehicles LiDAR data point cloud prediction

Summary

The paper presents "U NO," an unsupervised world model designed to predict 3D occupancy over time using unlabeled LiDAR data. Unlike traditional supervised approaches that rely on costly annotated data, U NO learns to forecast a continuous 4D occupancy field, capturing the geometry, dynamics, and semantics of environments critical for self-driving vehicles. The model demonstrates state-of-the-art performance in downstream tasks such as point cloud forecasting and birds-eye view semantic occupancy prediction, even outperforming fully supervised methods when labeled data is scarce. U NO's ability to generalize and effectively represent complex scenes enhances safety for self-driving applications, particularly for infrequent or vulnerable road users.

URHand: Universal Relightable Hands

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Chen_URHand_Universal_Relightable_Hands_CVPR_2024_paper.pdf

Topics

relightable hands neural rendering personalization computer vision 3D modeling

Summary

The paper presents URHand, a pioneering universal relightable hand model that effectively generalizes across various viewpoints, poses, illuminations, and identities using light-stage data. Unlike existing photorealistic models that require extensive identity-specific data, URHand allows for quick personalization from simple mobile phone scans. The model integrates a spatially varying linear lighting approach with a hybrid neural-physical rendering framework that enhances fidelity and generalizability. By addressing challenges in cross-identity training and maintaining photorealism during real-time rendering, URHand achieves significant improvements over prior methods, demonstrating its capability for rapid adaptation to new identities and dynamic lighting conditions.

Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Geng_Visual_Anagrams_Generating_Multi-View_Optical_Illusions_with_Diffusion_Models_CVPR_2024_paper.pdf

Topics

optical illusions diffusion models visual anagrams multi-view images image transformations

Summary

This paper presents a novel method for generating multi-view optical illusions using off-the-shelf text-to-image diffusion models. The authors introduce the concept of "visual anagrams," which are images that transform their appearance through various operations such as flips, rotations, and jigsaw rearrangements. The proposed technique operates in a zero-shot manner, estimating noise from different views and combining these estimates to create images that maintain visual coherence across transformations. The study includes both theoretical analysis and empirical results, demonstrating the effectiveness and flexibility of the method in generating a range of classic and innovative optical illusions, while also identifying design considerations critical for optimizing illusion quality.

Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Hu_Visual_Program_Distillation_Distilling_Tools_and_Programmatic_Reasoning_into_Vision-Language_CVPR_2024_paper.pdf

Topics

Visual Program Distillation Vision-Language Models Reasoning Large Language Models Program Generation

Summary

The paper introduces Visual Program Distillation (VPD), a framework that enhances vision-language models (VLMs) by leveraging large language models (LLMs) to generate and execute programs that solve complex visual reasoning tasks. VPD addresses the shortcomings of previous approaches by generating multiple candidate programs, executing them, and filtering for correctness to distill effective reasoning steps into VLMs. Experimental results demonstrate that models trained with VPD, specifically PaLI-X, outperform existing state-of-the-art VLMs on a variety of benchmarks, improving their abilities in counting, spatial reasoning, and consistency of answers. Additionally, VPD is shown to effectively adapt models to new tasks, even in the absence of labeled data, underscoring its potential for real-world applications.

WALT3D: Generating Realistic Training Data from Time-Lapse Imagery for Reconstructing Dynamic Objects Under Occlusion

Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Vuong_WALT3D_Generating_Realistic_Training_Data_from_Time-Lapse_Imagery_for_Reconstructing_CVPR_2024_paper.pdf

Topics

occlusion dataset generation 2D/3D reconstruction time-lapse imagery object understanding

Summary

The paper presents WALT3D, a novel framework for automatically generating realistic training data from time-lapse imagery to improve 2D and 3D object reconstruction under severe occlusions. It addresses the challenge of limited labeled datasets for occluded object understanding by utilizing off-the-shelf predictions as pseudo-ground-truth to create composite images that maintain realistic occlusion configurations. The method shows significant enhancements in both segmentation and shape reconstruction accuracy, particularly in urban environments where occlusions are common, and demonstrates scalability by eliminating the need for human labeling. Overall, WALT3D provides an efficient solution for training object reconstruction models in complex visual scenarios.