CVPR Technical Program Features Presentations on the Latest AI and Computer Vision Research

LOS ALAMITOS, Calif., 16 May 2024 – Co-sponsored by the IEEE Computer Society (CS) and the Computer Vision Foundation (CVF), the 2024 Computer Vision and Pattern Recognition (CVPR) Conference is the preeminent event for research and development (R&D) in the hot topic areas of computer vision, artificial intelligence (AI), machine learning (ML), augmented, virtual and mixed reality (AR/VR/MR), deep learning, and related fields. Over the past decade, these areas have seen significant growth, and the emphasis on this sector by the science and engineering community has fueled an increasingly competitive technical program.

This year, the CVPR Program Committee received 11,532 paper submissions—a 26% increase over 2023—but only 2,719 were accepted, resulting in an acceptance rate of just 23.6%. Of those accepted papers, only 3.3% were slotted for oral presentations based on nominations from the area chairs and senior area chairs overseeing the program.

“CVPR is not only the premiere conference in computer vision, but it’s also among the highest-impact publication venues in all of science,” said David Crandall, Professor of Computer Science at Indiana University, Bloomington, Ind., U.S.A., and CVPR 2024 Program Co-Chair. “Having one’s paper accepted to CVPR is already a major achievement, and then having it selected as an oral presentation is a very rare honor that reflects its high quality and potential impact.”

Taking place 17-21 June at the Seattle Convention Center in Seattle, Wash., U.S.A., CVPR offers oral presentations that speak to both fundamental and applied research in areas as diverse as healthcare applications, robotics, consumer electronics, autonomous vehicles, and more. Examples include:

- Pathology: Transcriptomics-guided Slide Representation Learning in Computational Pathology*– Training computer systems for pathology requires a multi-modal approach for efficiency and accuracy. New work from a multi-disciplinary team at Harvard University (Cambridge, Mass., U.S.A.), the Massachusetts Institute of Technology (MIT; Cambridge, Mass., U.S.A.), Emory University (Atlanta, Ga., U.S.A.) and others employs modality-specific encoders, and when applied on liver, breast, and lung samples from two different species, they demonstrated significantly better performance when compared to current baselines.
- Robotics: SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes – Creating realistic interactions in 3D scenes has been troublesome from a technology perspective because it has been difficult to manipulate objects in the scene context. Research from ETH Zürich (Zürich, Switzerland), Google (Mountainview, Calif., U.S.A.), Technical University of Munich (TUM; Munich, Germany), and Microsoft (Redmond, Wash., U.S.A.) has begun bridging that divide by creating a large-scale dataset with more than 14.8k highly accurate interaction annotations for 710 high-resolution real-world 3D indoor scenes. This work, as the paper concludes, has the potential to “stimulate advancements in embodied AI, robotics, and realistic human-scene interaction modeling.”
- Virtual Reality: URHand: Universal Relightable Hands – Teams from Codec Avatars Lab at Meta (Menlo Park, Calif., U.S.A.) and Nanyang Technological University (Singapore) unveil a hand model that generalizes to novel viewpoints, poses, identities, and illuminations, which enables quick personalization from a phone scan. The resulting images make for a more realistic experience of reaching, grabbing, and interacting in a virtual environment.
- Human Avatars: Semantic Human Mesh Reconstruction with Textures – Working to create realistic human models, teams at Nanjing University (Nanjing, China) and Texas A&M University (College Station, Texas, U.S.A.) designed a method of 3-D human mesh reconstruction that is capable of producing high-fidelity and robust semantic renderings that outperform state-of-the-art methods. The paper concludes, “This approach bridges existing monocular reconstruction work and downstream industrial applications, and we believe it can promote the development of human avatars.”
- Text-to-Image Systems: Ranni: Taming Text-to-Image Diffusion for Accurate Instruction – Existing text-to-image models can misinterpret more difficult prompts, but now, new research from Alibaba Group (Hangzhou, Zhejiang, China) and Ant Group (Hangzhou, Zhejiang, China) has made strides in addressing that issue via a middleware layer. This approach, which they have dubbed Ranni, supports the text-to-image generator in better following instructions. As the paper sums up, “Ranni shows potential as a flexible chat-based image creation system, where any existing diffusion model can be incorporated as the generator for interactive generation.”
- Autonomous Driving: Producing and Leveraging Online Map Uncertainty in Trajectory Prediction – To enable autonomous driving, vehicles must be pre-trained on the geographic region and potential pitfalls. High-definition (HD) maps have become a standard part of a vehicle’s technology stack, but current approaches to those maps are siloed in their programming. Now, work from a research team from the University of Toronto (Toronto, Ontario, Canada), Vector Institute (Toronto, Ontario, Canada), NVIDIA Research (Santa Clara, Calif., U.S.A.), and Stanford University (Palo Alto, Calif., U.S.A.) enhances current methodologies by incorporating uncertainty, resulting in up to 50% faster training convergence and up to 15% better prediction performance.

“As the field’s leading event, CVPR introduces the latest research in all areas of computer vision,” said Crandall. “In addition to the oral paper presentations, there will be thousands of posters, dozens of workshops and tutorials, several keynotes and panels, and countless opportunities for learning and networking. You really have to attend the conference to get the full scope of what’s next for computer vision and AI technology.”

Digital copies of all final technical papers* will be available on the conference website by the week of 10 June to allow attendees to prepare their schedules. To register for CVPR 2024 as a member of the press and/or request more on a specific paper, visit https://cvpr.thecvf.com/Conferences/2024/MediaPass or email media@computer.org. For more information on the conference, visit https://cvpr.thecvf.com/.

*Papers linked in this press release refer to pre-print publications. Final, citable papers will be available just prior to the conference.

About the CVPR 2024

The Computer Vision and Pattern Recognition Conference (CVPR) is the preeminent computer vision event for new research in support of artificial intelligence (AI), machine learning (ML), augmented, virtual and mixed reality (AR/VR/MR), deep learning, and much more. Sponsored by the IEEE Computer Society (CS) and the Computer Vision Foundation (CVF), CVPR delivers the important advances in all areas of computer vision and pattern recognition and the various fields and industries they impact. With a first-in-class technical program, including tutorials and workshops, a leading-edge expo, and robust networking opportunities, CVPR, which is annually attended by more than 10,000 scientists and engineers, creates a one-of-a-kind opportunity for networking, recruiting, inspiration, and motivation.

CVPR 2024 takes place 17-21 June at the Seattle Convention Center in Seattle, Wash., U.S.A., and participants may also access sessions virtually. For more information about CVPR 2024, visit cvpr.thecvf.com.

About the Computer Vision Foundation

The Computer Vision Foundation (CVF) is a non-profit organization whose purpose is to foster and support research on all aspects of computer vision. Together with the IEEE Computer Society, it co-sponsors the two largest computer vision conferences, CVPR and the International Conference on Computer Vision (ICCV). Visit thecvf.com for more information.

About the IEEE Computer Society

Engaging computer engineers, scientists, academia, and industry professionals from all areas and levels of computing, the IEEE Computer Society (CS) serves as the world’s largest and most established professional organization of its type. IEEE CS sets the standard for the education and engagement that fuels continued global technological advancement. Through conferences, publications, and programs that inspire dialogue, debate, and collaboration, IEEE CS empowers, shapes, and guides the future of not only its 375,000+ community members, but the greater industry, enabling new opportunities to better serve our world. Visit computer.org for more information.

CVPR Technical Program Features Presentations on the Latest AI and Computer Vision Research

- Pathology: Transcriptomics-guided Slide Representation Learning in Computational Pathology*– Training computer systems for pathology requires a multi-modal approach for efficiency and accuracy. New work from a multi-disciplinary team at Harvard University (Cambridge, Mass., U.S.A.), the Massachusetts Institute of Technology (MIT; Cambridge, Mass., U.S.A.), Emory University (Atlanta, Ga., U.S.A.) and others employs modality-specific encoders, and when applied on liver, breast, and lung samples from two different species, they demonstrated significantly better performance when compared to current baselines.
- Robotics: SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes – Creating realistic interactions in 3D scenes has been troublesome from a technology perspective because it has been difficult to manipulate objects in the scene context. Research from ETH Zürich (Zürich, Switzerland), Google (Mountainview, Calif., U.S.A.), Technical University of Munich (TUM; Munich, Germany), and Microsoft (Redmond, Wash., U.S.A.) has begun bridging that divide by creating a large-scale dataset with more than 14.8k highly accurate interaction annotations for 710 high-resolution real-world 3D indoor scenes. This work, as the paper concludes, has the potential to “stimulate advancements in embodied AI, robotics, and realistic human-scene interaction modeling.”
- Virtual Reality: URHand: Universal Relightable Hands – Teams from Codec Avatars Lab at Meta (Menlo Park, Calif., U.S.A.) and Nanyang Technological University (Singapore) unveil a hand model that generalizes to novel viewpoints, poses, identities, and illuminations, which enables quick personalization from a phone scan. The resulting images make for a more realistic experience of reaching, grabbing, and interacting in a virtual environment.
- Human Avatars: Semantic Human Mesh Reconstruction with Textures – Working to create realistic human models, teams at Nanjing University (Nanjing, China) and Texas A&M University (College Station, Texas, U.S.A.) designed a method of 3-D human mesh reconstruction that is capable of producing high-fidelity and robust semantic renderings that outperform state-of-the-art methods. The paper concludes, “This approach bridges existing monocular reconstruction work and downstream industrial applications, and we believe it can promote the development of human avatars.”
- Text-to-Image Systems: Ranni: Taming Text-to-Image Diffusion for Accurate Instruction – Existing text-to-image models can misinterpret more difficult prompts, but now, new research from Alibaba Group (Hangzhou, Zhejiang, China) and Ant Group (Hangzhou, Zhejiang, China) has made strides in addressing that issue via a middleware layer. This approach, which they have dubbed Ranni, supports the text-to-image generator in better following instructions. As the paper sums up, “Ranni shows potential as a flexible chat-based image creation system, where any existing diffusion model can be incorporated as the generator for interactive generation.”
- Autonomous Driving: Producing and Leveraging Online Map Uncertainty in Trajectory Prediction – To enable autonomous driving, vehicles must be pre-trained on the geographic region and potential pitfalls. High-definition (HD) maps have become a standard part of a vehicle’s technology stack, but current approaches to those maps are siloed in their programming. Now, work from a research team from the University of Toronto (Toronto, Ontario, Canada), Vector Institute (Toronto, Ontario, Canada), NVIDIA Research (Santa Clara, Calif., U.S.A.), and Stanford University (Palo Alto, Calif., U.S.A.) enhances current methodologies by incorporating uncertainty, resulting in up to 50% faster training convergence and up to 15% better prediction performance.

*Papers linked in this press release refer to pre-print publications. Final, citable papers will be available just prior to the conference.

About the CVPR 2024

About the Computer Vision Foundation

About the IEEE Computer Society