NJIT and NYU Researchers Advance AI Audio Captioning for Deaf and Hard-of-Hearing with NSF Grant
New Jersey Institute of Technology (NJIT) and NYU researchers are working together to break ground in the field of digital accessibility through a newly funded effort to make online video content more inclusive for deaf and hard-of-hearing (DHH) viewers.
Backed by $800,000 in National Science Foundation funding over the next three years, NJIT professors Mark Cartwright and Sooyeon Lee are helping develop AI systems that can automatically identify and caption meaningful non-speech sounds in online videos — everything from pertinent off-screen action such as footsteps and door slams to background music.
Their latest project, in partnership with NYU Assistant Professor Magdalena Fuentes, aims to address a persistent gap in current accessibility tools.
While automated speech recognition technology has made dialogue captions nearly ubiquitous on platforms like YouTube, non-speech audio remains largely uncaptioned, appearing in just 4% of popular videos, according to the team’s research so far.
For over a billion people worldwide with hearing loss, this gap means missing out on critical context and cues.
“Non-speech sounds in video — alerts, music cues, and ambient effects — that convey important information are often inaccessible to D/deaf and hard-of-hearing viewers,” said Lee, NJIT assistant professor of informatics. “This not only impairs their immediate understanding of content but also contributes to long-term exclusion from learning opportunities, communication and civic engagement — areas that increasingly rely on video in everyday life. We’re excited about our project’s potential to close this critical accessibility gap.”
“Current automated audio captioning models aim to describe every sound they hear, just like a person would. However, caption users don’t actually want descriptions of all sounds, and furthermore, how something should be described depends on the situation and the user’s needs,” added Cartwright, also an assistant professor of informatics. “The biggest challenge will be effectively leveraging sonic, visual and narrative contextual information, along with user preferences, to determine which sounds should be captioned and how to describe them.”
The team is designing their AI system to adapt caption style and detail to individual viewer preferences, a need revealed through extensive community engagement and user research.
Lee, who specializes in human-centered computing, has led a survey of 168 deaf and hard-of-hearing individuals, uncovering just how varied user needs can be.
“One of the most revealing moments came when a Deaf interviewee told us, ‘We want to have choices. ... That’s the dream!’ Through our research, we found that preferences around non-speech sound captioning — what types of sounds should be included and how they should be described — are incredibly diverse,” explained Lee. “A static best-practice guide would result in a one-size-fits-none solution. These insights fundamentally shaped our design approach.
“We aim to build an AI-powered system that adapts to individual preferences by accounting for multiple influencing factors — supporting personalization and making non-speech sound information more accessible, meaningful and inclusive for all DHH viewers.”
Cartwright and Lee are engaged in interviews and collaborative design workshops with deaf and hard-of-hearing participants, as well as content creators, to ensure the technology is accurate and helpful in real-world contexts.
To help drive industry adoption, the team is partnering with Adobe, Google/YouTube and New York Public Radio. The researchers also plan to make their software and datasets publicly available to support broader advances in accessibility.
However, Cartwright and Lee say the project’s promise extends beyond accessibility for the deaf and hard-of-hearing.
“This technology will also benefit individuals with varied accessibility needs — such as those experiencing hearing decline, neurodiverse users, individuals engaging with content in situations or environments where sound is unavailable or impractical, as well as non-native speakers,” Lee said.
At NYU, Fuentes will head the project’s AI research, focusing on integrating multiple types of contextual information into the captioning models. At NJIT, Lee is leading user research and participatory design efforts to understand stakeholders’ needs and develop user-facing solutions. Cartwright will manage the overall project, oversee data collection and lead development of the interactive and adaptive features of the captioning system.
“NJIT’s contribution to the partnership is our expertise in human-centered computing and accessibility, which are both crucial for creating effective captioning solutions for deaf and hard-of-hearing people,” noted Cartwright.
Initial AI prototypes are expected within two years, with a full-featured platform targeted for 2028.
“We expect that our platform will serve as a proof of concept and new vision for what captioning can be,” added Cartwright. “Through partnerships with companies and community groups, we hope parts of our approach will be adopted in real products and industry standards, making online video more accessible for everyone.”

