Generated on 2024-11-18 16:44:13 by PubSummarizer
The paper introduces the 360+x dataset, a pioneering multi-modal resource designed for comprehensive scene understanding from multiple perspectives, including panoramic, third-person, and egocentric views. It incorporates various data modalities such as video, multi-channel audio, directional binaural delay, location information, and textual descriptions, making it the first dataset to mimic human-like perception of environments. The authors conducted extensive benchmark analyses across five scene understanding tasks, revealing that models utilizing this diverse dataset significantly improve performance compared to existing datasets, particularly through the integration of different viewpoints and modalities. The findings suggest that even self-supervised models trained on 360+x can outperform those trained with human annotations, underscoring the dataset's potential to advance research in scene understanding.
The paper introduces the Subspace-Constrained Tyler's Estimator (STE), a novel algorithm designed for robust subspace recovery in datasets plagued by outliers. Combining aspects of Tyler's M-estimator and fast median subspace techniques, STE effectively recovers low-dimensional subspaces even when the proportion of inliers is less than previously established theoretical thresholds. The authors validate STE through its application to Structure from Motion (SfM), focusing on robust fundamental matrix estimation and the removal of outlying cameras. Numerical experiments demonstrate STE's superior performance compared to existing methods, showcasing its potential to enhance robustness in computer vision tasks, particularly in 3D reconstruction scenarios.
The paper presents "Alchemist," a novel method for manipulating material attributes such as roughness, metallicity, albedo, and transparency in real images using a modified text-to-image diffusion model. By addressing the scarcity of datasets with controlled material properties, the authors created a synthetic dataset featuring physically-based materials and fine-tuned a diffusion model on this data. The model allows for precise editing of material properties while preserving other image characteristics, offering alternatives to traditional rendering techniques that typically require extensive auxiliary information. Results demonstrate the model's effectiveness in editing real-world images and extend its application to Neural Radiance Fields (NeRF), showcasing its potential for various commercial applications in image editing and beyond.
This paper presents a novel N-point linear solver for line-based motion estimation using event cameras, which excel in high-speed and low-light conditions compared to traditional frame-based cameras. The authors introduce a new line parametrization that reduces the degrees of freedom from four to three, enabling a more efficient and numerically stable linear solver that can handle both minimal and overdetermined systems with more than five events. The proposed method showcases significant improvements in runtime—over 600 times faster than previous polynomial solvers—while maintaining high numerical stability and the ability to characterize degenerate cases. Additionally, a new velocity averaging scheme is introduced for efficiently fusing observations from multiple lines, enhancing the overall performance in both synthetic and real-world experiments, thereby demonstrating its suitability for modern mobile vision applications.
This paper presents an analysis and improvement of the training dynamics for diffusion models, specifically focusing on the ADM architecture. The authors identify several issues leading to uneven training, such as uncontrolled changes in network activations and weights. They propose modifications to standardize these magnitudes without altering the architecture's structure, resulting in enhanced performance, including a record FID score of 1.81 for ImageNet-512 synthesis. Additionally, they introduce a post-hoc method for adjusting EMA parameters post-training, enabling precise tuning and revealing significant interactions between EMA settings and network configurations. The findings suggest that the improved architecture and EMA techniques can facilitate more effective training and quality control in generative image synthesis.
The paper introduces DisenDiff, a novel personalized text-to-image (T2I) model that enhances the generation of customized images by effectively capturing and disentangling multiple concepts from a single reference image. It addresses limitations of existing methods, which often compromise visual consistency and fail to separate concepts adequately. The authors propose an attention calibration mechanism that includes learnable modifiers for different concepts, along with constraints to improve attention mapping and reduce concept interference. Through extensive qualitative and quantitative evaluations, the proposed method outperforms current state-of-the-art techniques, demonstrating superior visual fidelity and editing flexibility while also being compatible with existing image enhancement frameworks like LoRA.
The paper presents a novel approach to Event Stream Super-Resolution (ESR) through the development of a Bilateral Event Mining and Complementary Network (BMCNet). This method distinguishes between positive and negative events in event streams, utilizing a two-stream architecture to process each event type individually while facilitating their interaction via a Bilateral Information Exchange (BIE) module. The BMCNet effectively captures and exchanges complementary spatial and temporal information, significantly improving performance in ESR by over 11% compared to previous state-of-the-art methods. Additionally, the proposed framework enhances downstream tasks such as object recognition and video reconstruction, demonstrating its versatility and effectiveness in processing event camera data.
This paper presents BIOCLIP, a vision foundation model that leverages a newly curated dataset, TREEOFLIFE-10M, which contains over 10 million images across 454,000 taxa, aimed at enhancing the application of computer vision in biological research and conservation. The authors argue that existing models are often tailored for specific tasks and lack the adaptability needed for general organismal biology questions. BIOCLIP employs a contrastive learning approach to learn hierarchical representations aligned with the biological taxonomy, demonstrating substantial improvements over existing models in both zero-shot and few-shot classification tasks. The results indicate that BIOCLIP not only excels in identifying known species but also generalizes effectively to unseen taxa, significantly lowering the barriers for biologists to utilize AI in their work. The paper highlights the importance of dataset diversity and the hierarchical structure of taxonomic labels in achieving strong performance in biological image classification.
This paper investigates the decision-making mechanisms of visual recognition networks, specifically comparing Transformers and CNNs, using two novel methodologies: sub-explanation counting and cross-testing. The authors find that Transformers and ConvNeXt models exhibit greater compositionality, meaning they integrate multiple image parts for decisions, while traditional CNNs and distilled Transformers demonstrate disjunctive behaviors, relying on fewer parts for predictions. Key factors influencing these behaviors include the type of normalization used, with batch normalization leading to less compositionality compared to layer and group normalization. Additionally, cross-testing reveals that different network architectures utilize distinct visual features for classification, providing insights into their decision-making processes and suggesting directions for future model design.
This paper presents the correlation-aware multi-layer perceptron (CorrMLP), a novel approach for deformable medical image registration that aims to address the limitations of traditional transformers and convolutional neural networks (CNNs). While transformers have been effective in capturing long-range dependencies, their high computational demands restrict their application at full image resolutions, hampering fine-grained registration. In contrast, the CorrMLP utilizes a correlation-aware multi-window MLP block within a coarse-to-fine architecture, enabling efficient processing at full resolution and capturing local correlations vital for accurate registration. Extensive experiments on various medical datasets demonstrate that CorrMLP surpasses state-of-the-art methods in registration accuracy and transformation smoothness, highlighting the potential of MLPs in medical image registration tasks.
The paper presents CroSel, a novel approach to partial-label learning (PLL) that tackles the challenge of label ambiguity by selecting confident pseudo labels from a candidate set. CroSel utilizes a cross-selection strategy where two deep models exchange and refine their label predictions based on historical outputs, aiming to accurately identify true labels amidst noise. Additionally, it introduces a consistency regularization term called co-mix to mitigate sample waste and improve label selection accuracy. Empirical results demonstrate CroSel's effectiveness, achieving state-of-the-art performance on benchmark datasets, highlighting its ability to maintain high precision in label selection even under varying noise conditions.
The paper introduces DART (Doppler-Aided Radar Tomography), a novel approach for synthesizing radar range-Doppler images using a data-driven, Neural Radiance Field-inspired method. By incorporating radar-specific physics into an implicit rendering pipeline, DART enables the synthesis of accurate radar images from various viewpoints without explicit scene modeling. The authors constructed a custom data collection platform and a novel radar dataset to validate DART's efficacy against existing methods, demonstrating that it consistently outperforms state-of-the-art techniques in generating high-quality tomographic images. The method leverages the Doppler effect to enhance the resolution of radar measurements and presents a framework for realistic radar simulations that could significantly benefit applications in localization, mapping, and recognition.
This paper introduces a novel evaluation metric for image downscaling algorithms, called Image Downscaling Assessment by Rate-Distortion (IDA-RD), which quantifies the distortion incurred during the downscaling process by leveraging rate-distortion theory. Unlike traditional image-based quality measures, IDA-RD employs a process-based approach that views downscaling and super-resolution as encoding and decoding operations, respectively. The authors demonstrate that effective downscaling algorithms preserve more detail, leading to less distortion when images are upscaled. They address the challenges of measuring distortion through the use of recent advancements in deep generative models, specifically Generative Adversarial Networks (GANs) and Normalizing Flows, enabling the evaluation of downscaled images without requiring ground truth low-resolution images. Extensive experiments validate the effectiveness of IDA-RD across various synthetic and real-world downscaling methods, highlighting its potential to fill a significant gap in image downscaling research.
This paper introduces Set Difference Captioning, a novel task aimed at automatically describing the differences between two sets of images, termed VisDiff. The approach consists of a two-stage method involving a proposer that generates candidate descriptions from image sets and a ranker that evaluates these descriptions for salience. The authors present VisDiffBench, a benchmark dataset with 187 paired image sets to evaluate the method's performance. The results demonstrate VisDiff's effectiveness in identifying nuanced differences across various domains, such as model comparisons and dataset analysis, underscoring its potential as a tool for generating human-interpretable insights in computer vision and machine learning applications.
This paper presents a novel method called Diffusion-FOF for reconstructing 3D models of clothed humans from single-view images, addressing challenges such as varying body shapes, poses, and detailed textures. The method involves predicting a back-view image using a style consistency constraint, extracting multi-scale features, and employing a diffusion-based Fourier occupancy field (FOF) model in the wavelet domain to enhance geometric accuracy. The approach effectively integrates information from both the reference and estimated back-view images, culminating in the generation of a textured human model. Experimental results demonstrate that this method surpasses existing state-of-the-art techniques in both geometric and texture reconstruction performance.
The paper introduces DiffusionLight, a novel technique for estimating lighting from a single input image by inpainting a chrome ball using a pre-trained diffusion model (Stable Diffusion XL). Traditional methods often rely on HDR panorama datasets, which limit their effectiveness in real-world scenarios due to dataset diversity constraints. In contrast, this approach leverages the extensive training of diffusion models on billions of images, enhancing light estimation in uncontrolled environments. Key innovations include an iterative inpainting algorithm to ensure high-quality chrome ball generation and a LoRA fine-tuning technique for exposure bracketing, allowing the production of HDR chrome balls. The method demonstrates superior performance against existing techniques across various benchmarks and generalizes well to in-the-wild images, revealing significant advantages in lighting estimation tasks.
This paper presents a novel approach to the inverse rendering problem, which aims to recover an object's material properties and the surrounding illumination using unintended shadows cast by unobserved occluders, such as the camera operator. The authors utilize differentiable Monte Carlo ray tracing to jointly estimate spatially-varying materials, environment illumination, and the shapes of occluders that inadvertently cast shadows. By leveraging these shadows as additional signals, the method improves the conditioning of the inverse rendering problem, enabling more accurate recovery of high-frequency illumination and material details, even in challenging scenarios with diffuse materials. The effectiveness of the approach is demonstrated through experiments on both synthetic and real-world captured data, indicating its potential for enhancing the quality of material and lighting estimations in realistic imaging conditions.
The paper introduces Ego-Exo4D, a large-scale, multimodal, and multiview video dataset designed to enhance the understanding of skilled human activities from both egocentric (first-person) and exocentric (third-person) perspectives. Captured from 740 participants across 13 cities, the dataset includes 1,286 hours of video featuring various activities like sports, music, and cooking, complemented by extensive annotations such as audio, eye gaze, and 3D point clouds. It aims to facilitate research in areas like skill learning, proficiency estimation, and cross-view translation through a set of benchmark tasks. The open-sourced resources are intended to foster advancements in AI's comprehension of human skills and promote novel applications in domains such as augmented reality and robotics.
The paper presents EgoGen, a novel synthetic data generation system designed for egocentric perception tasks, particularly in augmented reality applications. EgoGen addresses the challenge of simulating natural human movements from the perspective of head-mounted devices by utilizing a generative human motion synthesis model that incorporates egocentric visual inputs. This model employs collision-avoiding motion primitives and a two-stage reinforcement learning approach to create realistic and diverse human motions in dynamic environments. The system generates high-quality synthetic data with accurate ground truth annotations, enhancing performance in key tasks such as mapping, localization, camera tracking, and human mesh recovery from egocentric views. By providing a scalable and effective solution for creating egocentric training data, EgoGen aims to advance research in egocentric computer vision.
This paper presents EGTR (Extracting Graph from Transformer), a lightweight one-stage model for Scene Graph Generation (SGG) that efficiently extracts relationships between objects from the self-attention layers of the DETR (DEtection TRansformer) decoder. Unlike traditional two-stage models, EGTR leverages the inherent relationships learned during object detection, utilizing a novel adaptive smoothing technique to enhance multi-task learning for both object detection and relation extraction. Additionally, it introduces a connectivity prediction task to aid relation prediction. Experimental results on the Visual Genome and Open Images V6 datasets demonstrate that EGTR achieves superior object detection performance and comparable triplet detection accuracy while maintaining reduced model complexity and faster inference speeds.
The paper presents EscherNet, a novel multi-view conditioned diffusion model that facilitates scalable view synthesis by generating consistent target views from arbitrary camera poses based on a flexible number of reference views. EscherNet employs a unique camera positional encoding (CaPE) to enhance camera control and ensure consistency across generated views. Demonstrating remarkable scalability, it can produce over 100 target views simultaneously on a consumer-grade GPU while achieving state-of-the-art performance compared to existing models. By decoupling from scene-specific optimizations and enabling zero-shot novel view synthesis, EscherNet unifies single and multi-image 3D reconstruction tasks, paving the way for advancements in 3D vision architectures.
The paper presents EvDiG, a novel method for separating direct and global illumination components in images using a hybrid system of RGB and event cameras. By leveraging the high temporal resolution of event cameras, the proposed approach efficiently captures rapid illumination changes caused by moving shadows, significantly reducing data acquisition time. The method employs a two-stage neural network, EvSepNet, to refine coarse separation results and restore color information, addressing challenges such as noise and color loss inherent in event data. Experimental results demonstrate that EvDiG outperforms state-of-the-art methods in both indoor and outdoor scenes, achieving high-quality separation comparable to multi-frame techniques while maintaining a capture time equivalent to single-frame methods.
The paper introduces EventPS, a novel approach to real-time photometric stereo using event cameras, which significantly enhances data efficiency and speed compared to traditional frame-based methods. By leveraging the high temporal resolution and low bandwidth characteristics of event cameras, EventPS estimates surface normals through radiance changes induced by a continuously rotating light source. This method offers a robust solution for both Lambertian and non-Lambertian surfaces by integrating optimization-based and deep-learning techniques. Experimental results demonstrate that EventPS operates at over 30 frames per second while reducing bandwidth usage to approximately 31% compared to frame-based counterparts, making it suitable for high-speed and time-sensitive applications.
This paper investigates the visual shortcomings of multimodal large language models (LLMs), specifically focusing on their reliance on the CLIP model for visual understanding. The authors identify systematic failures in visual question answering capabilities across various state-of-the-art LLMs, including GPT-4V, using a newly constructed benchmark called the Multimodal Visual Patterns (MMVP). The study reveals that these models struggle with basic visual details, often performing worse than random guessing. The authors propose a Mixture of Features (MoF) approach that integrates vision-centric representations to enhance visual grounding abilities, demonstrating that improved visual understanding can be achieved without sacrificing instruction-following capabilities. The findings underscore the importance of developing more robust visual representation learning methods and suggest that current scaling efforts are insufficient to address fundamental limitations in visual perception among LLMs.
The paper introduces FineParser, an advanced framework for fine-grained spatio-temporal action parsing aimed at enhancing action quality assessment (AQA) in human-centric contexts, particularly in sports like diving. Traditional AQA methods struggle with credibility and interpretability due to their coarse understanding of actions, which FineParser addresses by integrating four key components: a spatial action parser, a temporal action parser, a static visual encoder, and fine-grained contrastive regression. This model focuses on human-centric foreground action representations, allowing for precise assessments by minimizing the relevance of distracting backgrounds. The authors also present the FineDiving-HM dataset, featuring detailed human-centric action masks to foster improved evaluation of action quality. Extensive experiments demonstrate that FineParser significantly outperforms existing methods, showcasing its potential as a baseline for future AQA tasks requiring fine-grained action understanding.
The paper presents Florence-2, an advanced vision foundation model designed for diverse computer vision and vision-language tasks through a unified, prompt-based representation. Unlike existing models, Florence-2 excels in performing various tasks with simple textual instructions by leveraging a large-scale dataset, FLD-5B, which comprises 5.4 billion annotations across 126 million images. This dataset was generated through an innovative iterative strategy combining automated image annotation and model refinement. The model utilizes a sequence-to-sequence architecture, allowing it to address complex tasks involving spatial hierarchy and semantic granularity without requiring task-specific modifications. Extensive evaluations demonstrate Florence-2's strong performance, achieving state-of-the-art results in zero-shot settings and after fine-tuning, showcasing its effectiveness as a versatile foundation model in the realm of artificial intelligence.
This paper presents FlowIE, a novel image enhancement framework that leverages conditioned rectified flow to efficiently enhance images affected by various degradations. Traditional methods often struggle with robustness and speed, especially under complex real-world conditions, whereas FlowIE constructs a many-to-one transport mapping that significantly reduces inference time—up to tenfold compared to existing diffusion-based approaches. By accurately estimating straight-line paths from low-quality to high-quality images, FlowIE effectively utilizes the rich generative knowledge from pre-trained diffusion models. Additionally, the introduction of mean value sampling enhances path estimation accuracy, leading to high-quality enhancement across tasks such as blind face restoration and super-resolution. Extensive experiments demonstrate FlowIE's competitive performance and versatility, establishing it as a promising solution for image restoration challenges.
The paper introduces FMA-Net, a novel framework for joint video super-resolution and deblurring (VSRDB), which employs flow-guided dynamic filtering (FGDF) and iterative feature refinement with multi-attention (FRMA). The FGDF allows for precise estimation of motion-aware degradation and restoration kernels, enhancing the model's ability to handle large motions effectively. The FRMA iteratively refines features through a coarse-to-fine approach, utilizing a new temporal anchor loss to stabilize training. Extensive experiments demonstrate that FMA-Net outperforms state-of-the-art methods in both quantitative and qualitative assessments across various datasets, providing significant improvements in video restoration quality.
The paper presents FreeU, a novel method designed to enhance the performance of diffusion U-Nets, which are widely used in generative models for image and video synthesis. By analyzing the denoising process, the authors identify that the backbone of the U-Net is primarily responsible for denoising, while skip connections introduce high-frequency features. FreeU strategically re-weights these contributions through two scaling factors, significantly improving generation quality without requiring additional training or increasing computational costs. Extensive experiments demonstrate that FreeU can be easily integrated into existing diffusion models like Stable Diffusion and ModelScope, leading to superior image and video quality, thus showcasing its practical applicability in enhancing generative tasks.
This paper presents a novel framework, From-SAM-to-CAMs (S2C), for Weakly Supervised Semantic Segmentation (WSSS), which enhances the quality of Class Activation Maps (CAMs) by leveraging the Segment Anything Model (SAM) during training rather than just inference. The S2C framework consists of two main components: SAM-Segment Contrasting (SSC) and a CAM-based Prompting Module (CPM). SSC uses SAM's segmentation capabilities to create prototypes that guide feature learning in the classifier, while CPM refines CAMs into class-specific segmentation masks, aggregating these into a unified self-supervision mechanism. The proposed method significantly outperforms existing WSSS approaches across multiple benchmarks, demonstrating a robust ability to produce high-quality semantic segmentation maps.
This paper presents a novel approach for generating realistic animations from a single RGB image by modeling scene motion through a learned generative prior using spectral volumes. The method captures dense, long-term pixel trajectories in the Fourier domain, allowing for the transformation of still images into seamlessly looping videos and interactive simulations that respond to user inputs. The authors employ a frequency-coordinated latent diffusion model to predict spectral volumes, which are then converted into a motion texture for rendering future frames. The results demonstrate significant improvements in animation quality and coherence compared to previous methods, facilitating applications such as slow-motion effects and dynamic image interactions.
The paper presents GPLD3D, a novel latent diffusion model aimed at enhancing the geometric feasibility and physical stability of generated 3D shapes. Traditional generative models often fail to accurately represent critical shape properties, leading to disconnected and unstable synthetic shapes. GPLD3D addresses these issues by incorporating a quality checker that evaluates the geometric feasibility and physical stability of shapes during the diffusion process. This quality checker employs learned scoring functions to assess shapes, allowing for a principled adjustment of trade-off parameters in the model. Comprehensive experiments on the ShapeNet-v2 dataset demonstrate that GPLD3D significantly outperforms existing state-of-the-art shape generators in both qualitative and quantitative metrics, showcasing its effectiveness in producing high-quality synthetic 3D shapes.
This paper addresses the lack of a systematic analysis of grid-based models used for neural field representation by introducing a theoretical framework centered around grid tangent kernels (GTK). The authors demonstrate that GTKs are intrinsic properties that dictate the approximation and generalization behaviors of these models. Building on this theory, they present a novel grid-based model called the Multiplicative Fourier Adaptive Grid (MulFAGrid), which achieves superior generalization performance compared to existing models. Empirical results show that MulFAGrid excels in various tasks, including 2D image fitting, 3D signed distance field reconstruction, and novel view synthesis, indicating its robust representation capabilities and lower generalization bounds. The study offers insights that could guide the design of future grid-based models in computer vision and machine learning.
The paper presents the Image Processing GNN (IPG) model, which addresses the limitations of traditional Super-Resolution (SR) methods that rely on rigid pixel aggregation through CNNs and attention mechanisms. By leveraging the flexibility of graph structures, IPG adapts to the unbalanced nature of SR tasks, where detail-rich areas require more reconstruction effort. The model introduces a degree-varying graph construction that assigns higher connectivity to detail-rich pixels and utilizes both local and global sampling strategies for efficient information aggregation. Experimental results demonstrate that IPG outperforms state-of-the-art SR models across several datasets, showcasing its effectiveness in producing high-resolution images while maintaining computational efficiency.
This paper presents a novel method for improving semantic correspondence estimation by integrating weak geometric understanding through spherical mapping, aimed at addressing the limitations of current self-supervised models in recognizing object symmetries and repeated parts. By leveraging a weak 3D prior and coarse viewpoint information, the proposed approach enhances the discriminative capability of learned representations without requiring extensive 3D supervision. The authors also introduce a new evaluation metric, Keypoint Average Precision (KAP), which better accounts for symmetry-related errors compared to traditional metrics. Experiments demonstrate that the method significantly outperforms existing techniques on various datasets, showcasing its effectiveness in distinguishing between similar object parts across different views.
This paper investigates how data transformations, specifically random pixel permutation (RPP), can significantly speed up the training of neural fields, a data representation paradigm that requires extensive optimization. The authors find that RPP accelerates the training process by removing easily-fitted patterns that typically slow down the later stages of training, allowing the network to focus on capturing high-frequency details more effectively. Through empirical studies across various datasets and architectures, the authors demonstrate that RPP consistently reduces the number of stochastic gradient descent (SGD) steps needed to achieve desired fidelity levels. Their findings reveal that while RPP initially leads to slower fitting of moderate PSNR levels, it enables rapid convergence to high PSNR levels, thus offering insights into optimizing neural field training by leveraging the optimization bias inherent in SGD.
The paper introduces Instruct-Imagen, a novel image generation model designed to handle heterogeneous tasks by utilizing multi-modal instructions that integrate various modalities such as text, edges, and styles. The model employs a two-stage training approach, first adapting a pre-trained text-to-image diffusion model using retrieval-augmented training to enhance its capacity to ground generation on multi-modal context, followed by fine-tuning it on diverse image generation tasks. Instruct-Imagen demonstrates strong capabilities in understanding complex multi-modal instructions, achieving superior performance compared to prior state-of-the-art models in both in-domain and zero-shot tasks, effectively generalizing to unseen and complex image generation challenges. The research highlights the importance of multi-modal instruction in improving the model’s adaptability and generation accuracy across various tasks.
This paper introduces InternVL, a large-scale vision-language foundation model that scales the vision encoder to 6 billion parameters and aligns it with a language middleware to enhance performance on various visual-linguistic tasks. The model employs a progressive alignment strategy utilizing web-scale image-text data for efficient training, demonstrating state-of-the-art results across 32 benchmarks, including zero-shot image classification, video classification, and multi-modal dialogue systems. InternVL bridges the gap between vision and large language models, showcasing its versatility and effectiveness in handling generic visual-linguistic tasks through a robust design that integrates contrastive and generative learning approaches.
The paper presents a novel framework, LDP (Language-driven Dual-Pixel Image Defocus Deblurring Network), that leverages the contrastive language-image pre-training framework (CLIP) to estimate blur maps from dual-pixel (DP) image pairs without requiring additional data. The authors design specific text prompts to enable CLIP to understand and quantify blur-related geometric information, facilitating accurate blur map estimation. This estimated blur map is then utilized in a deblurring network featuring a blur-prior attention mechanism and specially formulated loss functions to restore sharp images from the DP pairs. The proposed method demonstrates state-of-the-art performance across multiple benchmark datasets, highlighting the effectiveness of combining language-driven approaches with low-level vision tasks.
This paper presents a novel approach to learning Structure-from-Motion (SfM) using Graph Attention Networks (GATs), aiming to improve the initialization process traditionally reliant on iterative minimization techniques like Bundle Adjustment (BA). The proposed model processes 2D keypoints across multiple views to predict camera poses and 3D coordinates without the need for scene-specific training or fine-tuning, significantly enhancing inference speed and generalization to unseen scenes. Experimental results demonstrate that the GAT-based method outperforms existing learning-based techniques and approaches the performance of conventional SfM methods like COLMAP, while substantially reducing runtime. Additionally, the model shows resilience to outliers through data augmentation and outlier injection strategies, suggesting a promising direction for future research in efficient and robust 3D reconstruction.
This paper proposes a novel visual localization method called DeViLoc, which enhances the accuracy of camera pose estimation in challenging scenarios like nighttime and adverse weather by generating reliable semi-dense 2D-3D correspondences. Unlike existing techniques that depend heavily on predefined 3D feature points, DeViLoc utilizes a Point Inference Network (PIN) to regress observed and unobserved 2D keypoints into 3D coordinates. This method effectively aggregates matching information through a Confidence-based Point Aggregation (CPA) module, significantly improving performance in noisy conditions. Comprehensive evaluations show that DeViLoc outperforms state-of-the-art methods across multiple datasets, demonstrating its robustness and adaptability in practical applications.
This paper introduces a novel task called weakly-supervised Narration-based Video Object Segmentation (NVOS), which aims to segment object instances mentioned in the narration of egocentric videos without requiring spatial annotations during training. The proposed framework, named ROSA, utilizes vision-language models to establish pixel-level alignments between referred objects and segmentation mask proposals. The authors address the challenges posed by cluttered scenes and object occlusions by employing a Global-Local Contrastive Learning approach, which combines video-narration alignment with region-phrase similarities. To evaluate their method, they create a new benchmark dataset called VISOR-NVOS, consisting of detailed object-based narrations linked to existing segmentation masks. The results demonstrate that ROSA achieves state-of-the-art zero-shot grounding performance on egocentric video datasets, showcasing its effectiveness in fine-grained video-language understanding.
The paper introduces a novel task called reasoning segmentation, which requires generating a binary segmentation mask from an implicit and complex text query related to an image. The authors propose LISA (Large Language Instructed Segmentation Assistant), a model that leverages the capabilities of multimodal large language models (LLMs) to handle such tasks effectively. LISA incorporates an additional
The paper presents LTGC, a novel framework for long-tail recognition that leverages the knowledge of large language models (LLMs) to generate diverse content for tail categories, addressing challenges such as data scarcity and class imbalance. LTGC employs a two-step process where it first analyzes existing tail data to create a description list and then extends this list using LLMs to generate new, diverse tail-class descriptions. These descriptions are transformed into images using a text-to-image model, and an iterative evaluation module ensures the quality of generated images. The framework incorporates the BalanceMix module to effectively fine-tune the model with both generated and original data, significantly improving performance on long-tail benchmarks compared to existing methods. Experimental results demonstrate that LTGC outperforms state-of-the-art techniques, showcasing its effectiveness in enhancing long-tail recognition tasks.
The paper introduces MicKey, a novel pipeline for estimating metric relative pose between two 2D images by predicting 3D keypoint coordinates directly from the images without requiring depth measurements or knowledge of image overlap. By leveraging a differentiable pose estimation framework, MicKey establishes metric correspondences in camera space through an end-to-end learning strategy that utilizes only relative pose supervision. The approach outperforms state-of-the-art methods in the Map-Free Relocalization benchmark while requiring less supervision, proving its effectiveness in applications requiring precise pose estimation for augmented reality.
This paper introduces MemSAM, a novel model designed for segmenting echocardiography videos by adapting the Segment Anything Model (SAM) to address challenges unique to medical video data, such as speckle noise, ambiguous boundaries, and variability of objects across frames. MemSAM employs a temporal-aware and noise-resilient prompting scheme using a space-time memory that captures both spatial and temporal information to enhance segmentation accuracy. The model incorporates a memory reinforcement mechanism to improve memory quality before updates, thereby mitigating the effects of noise and artifacts. Evaluations on two publicly available datasets demonstrate that MemSAM achieves state-of-the-art performance with limited annotations, comparable to fully supervised approaches, showcasing its potential for automated echocardiographic assessments.
The paper introduces MetaCloak, a novel approach to protect user images from unauthorized subject-driven text-to-image diffusion-based synthesis, addressing vulnerabilities in existing methods that are ineffective against data transformations. By employing a meta-learning framework and a transformation sampling process, MetaCloak generates robust, model-agnostic perturbations that effectively distort the semantic integrity of personalized images. Extensive experiments on datasets like VGGFace2 and CelebA-HQ demonstrate that MetaCloak outperforms previous methods, successfully fooling online training services and providing a strong defense against unauthorized use while maintaining a high level of robustness under various data transformations.
The paper presents Mip-Splatting, a novel enhancement of 3D Gaussian Splatting (3DGS) aimed at addressing aliasing artifacts encountered during image rendering at varying sampling rates. By introducing a 3D smoothing filter that constrains the maximum frequency of 3D Gaussian primitives based on the input views and replacing the traditional 2D dilation filter with a 2D Mip filter to better simulate physical imaging processes, the authors effectively eliminate high-frequency artifacts and improve rendering fidelity. Experimental results demonstrate that Mip-Splatting significantly outperforms existing methods, particularly in out-of-distribution scenarios, thereby enhancing generalization to different camera poses and zoom levels.
This paper presents a novel approach to simplify Vision Transformers by selectively removing non-essential attention layers based on entropy analysis. The authors argue that certain attention layers in lower blocks of the model carry less information and can be integrated into subsequent MLP layers without performance degradation. By employing an entropy-based selection strategy, termed NOSE, the method identifies which attention layers to prune in order to minimize the impact on overall model performance. Experimental results demonstrate that the proposed approach can reduce network parameters by 13.7% and improve throughput by 20.5% while maintaining competitive accuracy on tasks like image classification, showcasing its potential for enhancing computational efficiency in Vision Transformers.
The paper presents the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark, designed to assess the capabilities of large multimodal models (LMMs) in handling complex, college-level tasks that require expert knowledge across six disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. The benchmark comprises 11,500 questions featuring 30 heterogeneous image types and interleaved text, emphasizing advanced reasoning and perception aligned with expert-level performance. Evaluations of several models, including GPT-4V and Gemini, reveal significant challenges, with accuracies around 56% and 59%, respectively, underscoring the gap between current AI capabilities and expert-level reasoning. The MMMU aims to stimulate further research towards achieving expert artificial general intelligence by highlighting the complexities of multimodal understanding and the necessity for deep domain knowledge.
This paper addresses the complexities of modeling multimodal social interactions, particularly in multi-party environments, where both verbal and non-verbal cues are crucial for understanding social dynamics. The authors introduce three new tasks—speaking target identification, pronoun coreference resolution, and mentioned player prediction—within the context of social deduction games, accompanied by extensive dataset annotations. They propose a novel baseline model that utilizes densely aligned language-visual representations, allowing for a synchronized analysis of verbal utterances and corresponding visual features. Experimental results demonstrate the effectiveness of this approach, showcasing significant improvements over existing methods by capturing the intricate interactions among multiple speakers and their gestures. The paper contributes to the field by providing new tasks, a robust dataset, and a baseline model, thus facilitating further research in multimodal social interaction analysis.
The paper presents MonoHair, a novel framework for high-fidelity 3D hair modeling from monocular videos, addressing limitations of existing methods that require strict capture conditions or rely heavily on learned prior data. The approach consists of two main stages: the first involves precise exterior hair geometry reconstruction using a Patch-based Multi-View Optimization (PMVO) method, which integrates information from multiple views without prior data dependence. The second stage infers the interior hair structure using a data-driven multi-view reconstruction technique, enhancing accuracy by aligning with 2D structural renderings derived from the exterior geometry. Experimental results demonstrate that MonoHair robustly reconstructs diverse hairstyles, including curly hair, and achieves state-of-the-art performance with significant efficiency improvements over previous methods.
The paper introduces MultiPly, a novel framework designed to reconstruct multiple individuals in 3D from monocular videos captured in real-world settings. The framework addresses significant challenges such as occlusions and close human interactions, which complicate the accurate 3D modeling of multiple subjects. MultiPly employs a layered neural representation to separate individual human models and background, utilizing layer-wise differentiable volume rendering to learn from video data. Additionally, it incorporates a hybrid instance segmentation approach and a confidence-guided optimization strategy to ensure high fidelity and temporal consistency in the reconstructions. Evaluation results demonstrate MultiPly's superiority over existing methods in various tasks, including human reconstruction and pose estimation, particularly in complex scenarios with severe occlusions.
The paper presents a novel approach called Heuristics-Guided Segmentation (HuGS) to enhance Neural Radiance Fields (NeRF) in non-static scenes, addressing the challenges posed by transient distractors like moving objects and shadows. The proposed method combines hand-crafted heuristics with advanced segmentation techniques to effectively differentiate static elements from transient ones, improving the quality of 3D scene reconstruction. By integrating Structure-from-Motion (SfM)-based heuristics and color residual heuristics, HuGS achieves accurate static vs. transient separations across diverse textures. Extensive experiments demonstrate its robustness and superiority over existing methods in mitigating artifacts in NeRF trained on non-static scenes, showcasing significant improvements in view synthesis quality.
Paper URL: https://openaccess.thecvf.com/content/CVPR2024/papers/Yu_Neural_Lineage_CVPR_2024_paper.pdf
This paper introduces a novel task called neural lineage detection, which aims to identify the parent-child relationships between neural network models based on fine-tuning processes. Two methodologies are proposed: a learning-free approach that integrates an approximation of fine-tuning into similarity metrics for lineage detection and a learning-based method utilizing a transformer architecture for enhanced accuracy. Experimental results demonstrate that both methods outperform existing baseline techniques across various model architectures and tasks, including classification, segmentation, and detection, while also effectively tracing cross-generational lineage. The paper highlights the significance of understanding model relationships for applications in model reuse, intellectual property protection, and accountability in deep learning.
The paper "Neural Redshift: Random Networks are not Random Functions" investigates the generalization capabilities of neural networks (NNs) by examining untrained, random-weight networks to identify inductive biases independent of gradient descent. The authors find that architectures, particularly those using ReLU activations, exhibit strong biases toward low-complexity functions, which align with real-world data patterns. This phenomenon, termed "Neural Redshift," suggests that the effectiveness of NNs is not an inherent property but rather a result of suitable architectural choices. The study provides a fresh perspective on deep learning's success, emphasizing the importance of understanding the complexities of neural network architectures and their implications for training and generalization across various tasks.
This paper presents NoiseCLR, an unsupervised contrastive learning framework designed to discover interpretable directions in text-to-image diffusion models, specifically targeting Stable Diffusion. Unlike existing methods that rely on textual prompts or labeled data, NoiseCLR identifies semantically meaningful directions using a small set of unlabeled images from various domains such as faces, cats, and art. The approach facilitates highly disentangled image edits, allowing for simultaneous modifications within a single domain or across multiple domains without interference. Extensive experiments demonstrate that NoiseCLR outperforms existing diffusion-based and GAN-based image editing techniques, enhancing both control and transparency in the generative process while addressing potential biases inherent in large models.
This paper presents a theoretical framework for modeling opaque solids as volumetric entities using stochastic geometry. The authors establish conditions under which opaque solids can be represented via exponential volumetric transport and derive expressions for the volumetric attenuation coefficient based on probability distributions of underlying indicator functions. The study extends traditional volumetric representations to accommodate both isotropic and anisotropic scattering behaviors and introduces stochastic implicit surface representations. By rigorously deriving their model from first principles, the authors ensure compliance with physical constraints such as reciprocity and reversibility. Experimental results demonstrate significant improvements in 3D reconstruction tasks when employing their proposed volumetric representation compared to previous methods. The findings suggest a robust foundation for future research in volumetric modeling of opaque solids in computer graphics and applied physics.
The paper introduces Panoptic Scene Completion (PSC), an advanced method for understanding 3D scenes by integrating geometry, semantics, and instance-level predictions from sparse input data. The proposed technique, PaSCo, employs a hybrid mask-based approach leveraging a multi-input multi-output (MIMO) strategy to enhance performance and provide uncertainty estimates at both voxel and instance levels, crucial for applications in robotics and autonomous driving. Experimental results demonstrate that PaSCo surpasses existing methods in both PSC accuracy and uncertainty estimation across three large-scale urban datasets, showcasing its effectiveness and efficiency in completing and interpreting 3D scenes.
The paper presents pixelSplat, a novel feed-forward model for 3D reconstruction that utilizes pairs of images to learn and generate 3D radiance fields defined by 3D Gaussian primitives. This approach achieves real-time and memory-efficient rendering while overcoming local minima issues associated with sparse representations by predicting a dense probability distribution for Gaussian means. The model employs a differentiable sampling technique that allows gradient backpropagation through the Gaussian representation, leading to significant performance improvements over existing methods, particularly in terms of rendering speed and resource efficiency. Extensive benchmarking on datasets like RealEstate10k and ACID demonstrates that pixelSplat outperforms state-of-the-art light field transformers, producing interpretable and editable 3D representations while drastically reducing training and inference costs.
The paper presents PlatoNeRF, a novel approach for 3D scene reconstruction from a single view using two-bounce signals captured by single-photon lidar. Unlike traditional NeRF methods that rely on multiple views, PlatoNeRF leverages time-of-flight data to accurately model occluded geometry without relying on data priors or controlled lighting conditions. The method combines the strengths of neural radiance fields and two-bounce lidar data, enabling it to reconstruct both visible and hidden geometry effectively. Experimental results demonstrate that PlatoNeRF outperforms existing methods in terms of accuracy and robustness, particularly under varying sensor constraints and scene properties, making it a promising solution for applications in autonomous systems and extended reality.
The paper introduces Point Transformer V3 (PTv3), a novel model designed to enhance point cloud processing by prioritizing simplicity and efficiency over complex design elements. PTv3 significantly expands the receptive field from 16 to 1024 points, achieving a 3.3 times increase in processing speed and a 10.2 times reduction in memory consumption compared to its predecessor, PTv2. Utilizing a serialization-based approach, PTv3 effectively organizes unstructured point clouds, allowing for improved performance in over 20 downstream tasks across both indoor and outdoor scenarios. The model demonstrates state-of-the-art results in semantic segmentation and object detection while maintaining low latency, making it suitable for real-time applications. Overall, PTv3 embodies a shift towards scalability in model design, emphasizing the importance of efficient processing in 3D perception tasks.
The paper discusses enhancements to online high-definition (HD) map estimation methods for autonomous vehicles (AVs) by incorporating uncertainty estimates, which are critical for improving trajectory prediction. Traditional mapping approaches lack confidence measures, causing potential errors in downstream tasks. The authors extend existing state-of-the-art methods to output uncertainty alongside map data, demonstrating that this integration leads to a 50% faster training convergence and a 15% improvement in prediction accuracy, as validated on the nuScenes dataset. The proposed framework captures various sources of uncertainty, such as occlusions and sensor range, and shows significant improvements in the performance of trajectory prediction models when leveraging this enhanced mapping information.
The paper presents Ranni, an innovative framework designed to enhance text-to-image (T2I) diffusion models' ability to interpret and respond to complex prompts. Ranni leverages a semantic panel, a structured middleware that organizes visual concepts parsed from input text using large language models (LLMs). This approach addresses common challenges in T2I synthesis, such as object quantity, attribute binding, and spatial relationships. By dividing the generation process into text-to-panel and panel-to-image tasks, Ranni improves textual controllability and facilitates intuitive image editing through user-friendly operations. The framework supports both manual and LLM-assisted editing, demonstrating significant advancements in prompt following accuracy, interactive generation, and continuous refinement of images based on user instructions. Overall, Ranni represents a notable step forward in creating flexible, chat-based image generation systems.
The paper presents Relightable Gaussian Codec Avatars, a novel approach for creating high-fidelity, relightable 3D head avatars that can be animated in real-time using 3D Gaussians and learnable radiance transfer. This method addresses the challenges of accurately modeling complex human head materials, including skin, hair, and eyes, by employing a unified appearance model that supports all-frequency reflections and diverse materials. The use of 3D Gaussians allows for detailed geometric representations, particularly of intricate structures like hair, while the learnable radiance transfer facilitates efficient relighting under various illumination conditions. The approach excels in real-time performance, demonstrating significant improvements over existing methods in terms of both visual fidelity and computational efficiency, particularly in applications such as gaming and telecommunication.
This paper introduces Marigold, a novel method for monocular depth estimation that repurposes the capabilities of diffusion-based image generators, specifically leveraging the Stable Diffusion model. By fine-tuning this model with synthetic data, Marigold achieves state-of-the-art performance in depth estimation across various datasets, even in zero-shot scenarios where it encounters unfamiliar content. The approach emphasizes utilizing the extensive visual knowledge embedded in generative models to enhance generalizability and accuracy in depth estimation tasks. The results demonstrate significant improvements in depth estimation quality, highlighting the potential of combining generative modeling techniques with depth estimation frameworks.
This paper addresses the challenges in surface normal estimation from single RGB images by proposing new inductive biases tailored for this task. The authors suggest utilizing per-pixel ray direction and modeling the relative rotational relationships between neighboring surface normals. Their method incorporates these biases into a deep learning framework, allowing for piecewise smooth predictions that maintain detail, even in complex, out-of-distribution images. The results demonstrate that their approach outperforms state-of-the-art methods, particularly in terms of generalization ability, despite being trained on a substantially smaller dataset. The model's architecture is designed to work with images of arbitrary resolution and aspect ratio, making it suitable for various computer vision applications.
The paper presents the Retrieval-Augmented Layout Transformer (RALF), a novel approach for content-aware layout generation that addresses the limitations of existing methods due to data scarcity. By incorporating retrieval augmentation, RALF enhances the layout generation process by retrieving similar layout examples based on input images and integrating these references into an autoregressive model. The model demonstrates superior performance in generating high-quality layouts across various tasks, achieving significant improvements over baseline methods with less training data required. Extensive experiments validate RALF's capability to produce aesthetically pleasing and contextually appropriate layouts, making it a promising tool for graphic design applications.
This paper presents a novel approach to enhancing text-to-image (T2I) generation by introducing a rich human feedback dataset (RichHF-18K) composed of 18,000 annotated images. The dataset includes detailed annotations marking implausible regions and misaligned text prompts, alongside fine-grained scores for various quality aspects (plausibility, alignment, aesthetics, and overall quality). The authors develop a multimodal transformer model, Rich Automatic Human Feedback (RAHF), which predicts these detailed feedback annotations, demonstrating significant improvements in image generation quality through training and inpainting techniques. The study reveals that the predicted feedback enhances generative models like Muse, showcasing the potential of rich human feedback in refining T2I outputs and setting a foundation for future research in this area.
The paper presents RoHM, a robust method for 3D human motion reconstruction from monocular RGB(-D) videos, specifically designed to handle noisy and occluded inputs. Unlike previous methods that either directly regress 3D motion or utilize time-consuming optimization techniques, RoHM leverages diffusion models to iteratively denoise and infill motion data, achieving globally coherent motion representation. The approach is structured around two separate models for global trajectory and local motion, enhanced by a flexible conditioning module to capture their interdependencies. Extensive experiments demonstrate that RoHM significantly outperforms state-of-the-art techniques in accuracy and physical plausibility, while also being 30 times faster during inference. The method's versatility is validated across diverse datasets, making it a promising advancement in the field of human motion reconstruction.
This paper presents S2MAE, a specialized pre-trained model designed for spectral remote sensing (RS) data, addressing the inadequacies of existing models that primarily focus on RGB imagery. S2MAE utilizes a 3D transformer architecture with a 90% masking ratio to capture local spectral consistency and spatial invariance, effectively leveraging large unlabeled spectral datasets through progressive pretraining. The model's efficacy is validated across three downstream tasks, demonstrating superior performance in single and multi-label classification as well as change detection, significantly outperforming existing methods. Through extensive ablation studies, the research highlights the importance of a high masking ratio and the need for tailored masking strategies to enhance representation learning in spectral imagery.
This paper introduces SAFDNet, a novel fully sparse adaptive feature diffusion network designed for LiDAR-based 3D object detection, which addresses the computational inefficiencies associated with dense feature maps in existing models. SAFDNet employs an adaptive feature diffusion strategy to mitigate the center feature missing problem prevalent in fully sparse detectors. Experimental results demonstrate that SAFDNet outperforms previous state-of-the-art methods, achieving superior accuracy on long-range detection tasks, particularly on the Argoverse2 dataset, while also maintaining faster inference speeds compared to hybrid and other sparse detectors. The architecture's design allows for straightforward adaptation to various scenarios, making it a promising approach for enhancing performance in autonomous driving applications.
The paper introduces SceneFun3D, a novel large-scale dataset that provides over 14,800 fine-grained functional interaction annotations across 710 high-resolution 3D indoor scenes. This dataset includes detailed motion parameters and diverse natural language task descriptions, aimed at enhancing 3D scene understanding beyond traditional semantic segmentation. The authors define three new tasks—functionality segmentation, task-driven affordance grounding, and 3D motion estimation—to evaluate model performance. Their findings indicate that existing methods struggle with accurately detecting and interacting with functional elements in real-world scenarios, highlighting the need for improved understanding of affordances and interaction dynamics in 3D environments.
This paper presents a novel approach to reconstructing a radiance field of a scene observed by a person using only the reflections from their eyes in a sequence of images taken from a stationary camera. The authors address challenges such as accurately estimating eye poses and separating the complex iris textures from scene reflections. They introduce a method that optimizes cornea poses and incorporates a regularization prior for iris texture to enhance the quality of scene reconstruction. The approach is validated through extensive experiments on both synthetic and real-world datasets, demonstrating its effectiveness in recovering detailed 3D scenes from eye reflections, even under various lighting conditions.
The paper presents SHERT, a novel framework for semantic human mesh reconstruction that effectively combines geometric detail and texture generation from monocular images. SHERT addresses challenges faced by existing methods, such as unstable results and low-quality meshes, by employing a pipeline that includes semantic- and normal-based sampling, self-supervised mesh completion and refinement, and a diffusion model for texture generation driven by both images and text prompts. The framework ensures high-quality triangle meshes, stable UV unwrapping, and the ability to animate and substitute different body parts, demonstrating superior performance compared to state-of-the-art techniques in both quantitative and qualitative experiments.
This paper introduces Recursive Specularity Factorization (RSF), a novel technique for low-light image enhancement that decomposes images into multiple additive specular components, allowing for controllable image relighting and improved enhancement tasks. By employing a model-driven RSFNet, the authors achieve zero-reference low-light enhancement without the need for paired training data, demonstrating superior performance on various benchmarks. The RSF method utilizes recursive estimation of sparsity thresholds to effectively separate specular factors, which can be used not only for low-light enhancement but also for other applications like dehazing, deraining, and deblurring. The results indicate that RSFNet outperforms existing state-of-the-art methods, showcasing high generalizability across diverse datasets.
The paper "SpiderMatch" presents a novel approach for 3D shape matching that achieves global optimality and geometric consistency by utilizing a new representation called SpiderCurve, which is a self-intersecting curve tracing the surface of a 3D shape. The authors tackle the 3D shape matching problem by formulating it as an integer linear programming (ILP) problem, and they introduce constraints to maintain geometric consistency during the matching process. Their method is evaluated against existing state-of-the-art approaches and demonstrates competitive performance while ensuring that matches are both geometrically consistent and optimal. The experimental results indicate that their approach significantly outperforms previous methods, particularly in preserving neighborhood relationships between shape elements.
The paper introduces a novel framework called "steerers" designed for rotation equivariant keypoint descriptors, which enhances the robustness of learned image descriptors against large rotations while maintaining performance on upright images. Traditional learned descriptors struggle with rotation invariance, leading to either a loss of discriminative power or increased computational costs through test-time augmentation. The steerers function as linear transforms that adjust keypoint descriptors to simulate the effects of image rotation without needing to reprocess the images. The authors explore three optimization settings for steerers—fixing the descriptor, jointly optimizing both descriptor and steerer, and optimizing the descriptor while fixing the steerer—and demonstrate that their approach achieves state-of-the-art results on benchmarks like AIMS and Roto-360, while also performing competitively on non-rotated images in MegaDepth. The paper contributes to the theoretical understanding of steerers and their practical applications in improving image matching tasks in various domains, including 3D reconstruction and space applications.
The paper presents a novel method called Stratified Avatar Generation (SAGE) for reconstructing 3D full-body avatars from sparse observations, primarily using input from head-mounted devices that track only the head and hands. The authors address the challenges of accurately predicting lower body movements from limited data by employing a two-stage approach that first reconstructs the upper body and subsequently infers the lower body conditioned on the upper body’s output. Utilizing a disentangled body representation based on the Skinned Multi-Person Linear (SMPL) model, the method incorporates a latent diffusion model to represent and generate motion sequences more effectively. Extensive experiments demonstrate that SAGE outperforms existing state-of-the-art methods, particularly in lower-body motion estimation, showcasing its potential for enhancing immersive experiences in AR/VR applications.
The paper presents StyleAligned, a new method for ensuring style consistency in images generated by large-scale Text-to-Image (T2I) models, which traditionally struggle to maintain uniform style across outputs. StyleAligned utilizes minimal attention sharing during the diffusion process, allowing generated images to adhere to a reference style without requiring optimization or fine-tuning. The technique was evaluated against various styles and prompts, demonstrating superior style consistency and visual coherence compared to existing methods, while also maintaining high-quality image synthesis. By leveraging adaptive normalization in attention layers, StyleAligned effectively balances diversity and style adherence, paving the way for practical applications in creative domains.
This paper presents a novel framework called Constrained Empirical Risk Minimization (CERM) for optimizing wavelets within deep learning architectures, specifically Convolutional Neural Networks (CNNs). The authors tackle the challenge of enforcing strict structural constraints on the network's convolutional filters to ensure they conform to wavelet properties, which is critical in applications like medical imaging, where accurate contour prediction is paramount. By using CERM, the filters are optimized to become task-specific wavelets during training, addressing limitations of traditional loss function-based constraints. Empirical evaluations demonstrate that the proposed wavelet networks significantly outperform baseline methods in contour prediction tasks on medical datasets, showcasing their efficacy in leveraging wavelet properties for enhanced performance in specialized applications.
This paper presents a novel method called Action Segmentation Optimal Transport (ASOT) for unsupervised action segmentation in long, untrimmed videos, focusing on achieving temporal consistency without requiring prior knowledge of action order. By formulating the task as a fused unbalanced Gromov-Wasserstein optimal transport problem, ASOT effectively decodes segmentations from a noisy affinity cost matrix between video frames and action classes. The method is evaluated across various datasets, including Breakfast and YouTube Instructions, demonstrating state-of-the-art results in unsupervised learning settings. ASOT also serves as a post-processing tool that enhances the performance of existing supervised methods. The approach addresses limitations of previous methods that enforce balanced assignments and rely on known action orderings, thus providing a more flexible and robust framework for action segmentation tasks.
This paper presents a novel approach for low-light image enhancement (LIE) using a new large-scale dataset called SDE, which contains over 30,000 spatially and temporally aligned pairs of images and events captured under varying illumination conditions. The authors developed a robotic system to ensure high precision in data collection and introduced an event-guided LIE framework named EvLight that integrates features from both images and events. Their method employs a signal-to-noise ratio (SNR)-guided strategy for selective feature fusion, enhancing robustness against illumination variations and noise. Experimental results demonstrate that EvLight significantly outperforms existing frame-based and event-guided methods, highlighting its effectiveness in improving low-light image quality.
This paper presents TANGLE, a novel framework that employs transcriptomics-guided slide representation learning to improve the processing and classification of giga-pixel whole-slide images (WSIs) in computational pathology. By leveraging gene expression profiles alongside histology slides, TANGLE utilizes multimodal pre-training to create robust slide embeddings that significantly enhance few-shot classification, prototype-based classification, and slide retrieval tasks across multiple datasets involving human and rat tissues. The study demonstrates that TANGLE outperforms existing self-supervised and supervised learning baselines, showcasing its potential for improved diagnostic capabilities in pathology.
The paper presents a novel framework called Tri-Perspective View Decomposition (TPVD) to enhance depth completion, an essential task in autonomous driving that involves reconstructing dense depth maps from sparse measurements. Unlike traditional methods that primarily utilize 2D representations or directly incorporate 3D point clouds, TPVD decomposes the 3D point cloud into three distinct 2D views, effectively allowing for the densification of sparse depth inputs while preserving 3D geometric information. The framework employs a TPV Fusion mechanism for recurrent 2D-3D-2D feature aggregation and introduces a Distance-Aware Spherical Convolution for improved geometric consistency. The proposed method outperforms existing state-of-the-art approaches on several benchmark datasets, including KITTI, NYUv2, and SUN RGBD, and contributes a new depth completion dataset, TOFDC, acquired using mobile time-of-flight sensors.
The paper presents "U NO," an unsupervised world model designed to predict 3D occupancy over time using unlabeled LiDAR data. Unlike traditional supervised approaches that rely on costly annotated data, U NO learns to forecast a continuous 4D occupancy field, capturing the geometry, dynamics, and semantics of environments critical for self-driving vehicles. The model demonstrates state-of-the-art performance in downstream tasks such as point cloud forecasting and birds-eye view semantic occupancy prediction, even outperforming fully supervised methods when labeled data is scarce. U NO's ability to generalize and effectively represent complex scenes enhances safety for self-driving applications, particularly for infrequent or vulnerable road users.
The paper presents URHand, a pioneering universal relightable hand model that effectively generalizes across various viewpoints, poses, illuminations, and identities using light-stage data. Unlike existing photorealistic models that require extensive identity-specific data, URHand allows for quick personalization from simple mobile phone scans. The model integrates a spatially varying linear lighting approach with a hybrid neural-physical rendering framework that enhances fidelity and generalizability. By addressing challenges in cross-identity training and maintaining photorealism during real-time rendering, URHand achieves significant improvements over prior methods, demonstrating its capability for rapid adaptation to new identities and dynamic lighting conditions.
This paper presents a novel method for generating multi-view optical illusions using off-the-shelf text-to-image diffusion models. The authors introduce the concept of "visual anagrams," which are images that transform their appearance through various operations such as flips, rotations, and jigsaw rearrangements. The proposed technique operates in a zero-shot manner, estimating noise from different views and combining these estimates to create images that maintain visual coherence across transformations. The study includes both theoretical analysis and empirical results, demonstrating the effectiveness and flexibility of the method in generating a range of classic and innovative optical illusions, while also identifying design considerations critical for optimizing illusion quality.
The paper introduces Visual Program Distillation (VPD), a framework that enhances vision-language models (VLMs) by leveraging large language models (LLMs) to generate and execute programs that solve complex visual reasoning tasks. VPD addresses the shortcomings of previous approaches by generating multiple candidate programs, executing them, and filtering for correctness to distill effective reasoning steps into VLMs. Experimental results demonstrate that models trained with VPD, specifically PaLI-X, outperform existing state-of-the-art VLMs on a variety of benchmarks, improving their abilities in counting, spatial reasoning, and consistency of answers. Additionally, VPD is shown to effectively adapt models to new tasks, even in the absence of labeled data, underscoring its potential for real-world applications.
The paper presents WALT3D, a novel framework for automatically generating realistic training data from time-lapse imagery to improve 2D and 3D object reconstruction under severe occlusions. It addresses the challenge of limited labeled datasets for occluded object understanding by utilizing off-the-shelf predictions as pseudo-ground-truth to create composite images that maintain realistic occlusion configurations. The method shows significant enhancements in both segmentation and shape reconstruction accuracy, particularly in urban environments where occlusions are common, and demonstrates scalability by eliminating the need for human labeling. Overall, WALT3D provides an efficient solution for training object reconstruction models in complex visual scenarios.