AXSChat Podcast

Unraveling the World of Audio Descriptions and Artificial Intelligence

Antonio Santos, Debra Ruh, Neil Milliken talk with Dan Sommer

What happens when a professional singer and language teacher pivots into the world of accessibility? Meet Dan Sommer, CEO  and Founder of Empire Caption Solutions, who shares his fascinating journey and how he's working to balance the quality of human captioners with the scalability of technology. Discover how his unique background has shaped his approach to accessibility and the vital role of AI and speech-to-text tools in creating meaningful audio descriptions.

In this captivating episode, we explore the complexities of audio description and the challenges involved in creating them, as well as the platforms that are excelling in this area. Find out how the progress of ASR over time has impacted the captioning process and how emerging technology can be used to build on top of quality products. Join us as we dive into the future of accessibility services and discuss the intersection of accessibility, technology, and language with our insightful guest, Dan Sommer.

Support the show

Follow axschat on social media.
Bluesky:
Antonio https://bsky.app/profile/akwyz.com

Debra https://bsky.app/profile/debraruh.bsky.social

Neil https://bsky.app/profile/neilmilliken.bsky.social

axschat https://bsky.app/profile/axschat.bsky.social


LinkedIn
https://www.linkedin.com/in/antoniovieirasantos/
https://www.linkedin.com/company/axschat/

Vimeo
https://vimeo.com/akwyz

https://twitter.com/axschat
https://twitter.com/AkwyZ
https://twitter.com/neilmilliken
https://twitter.com/debraruh

Neil Milliken:

Hello and welcome AXSChat. I'm delighted today that we're joined by Dan Summer, who is the COO of Empire Caption Solutions, so it's just Antonio and I today. Deborah is off on some personal business, but we're delighted to have Dan with us. So, dan, really pleased to have you here. We were talking a little bit before we came on air about your background, so I think it's interesting. We all come to accessibility through a fairly securitist route, so can you tell us a little bit about yourself, how you came to be working in this space and what it is that you're doing right now?

Dan Sommer:

Sure, thanks. Thanks so much for having me. It's really great to be here. Yeah, we all come to accessibility from different routes and different paths, and I started out as a professional singer in New York and also taught German and English and Latin pronunciation to singers and choirs and coaches and things like that, and it was really through there that I got an interest in technology was around the time when Khan Academy was starting up and so the flip, the classroom thing was going on, and so technology, kind of I became interested in tech through that And then, you know, i became a type well transcriber which is meaning for meaning And that's real time.

Dan Sommer:

It's usually a lot in in high school, undergrad, one on one settings where individual accommodations are needed, and from there I became more interested in accessibility, video accessibility. I got more into closed captioning for pre recorded video. The schedule was a little bit better. You could just see a, could get it done by the end of the day and feel good, and you know, and from there I, you know, the video accessibility component became really quite fascinating. Things like closed captions, love titling or translation, and then, you know, learning more about audio description, and so it's just, it's been a really great way to kind of keep my curiosity going and, you know, try to help people where I can and figure out how to utilize technology. And been doing that for about the past 10 years or so and we're in a really exciting time And, yeah, that's, that's kind of how I got into it, great.

Neil Milliken:

So, yeah, i think I also fell into accessibility, fell into sort of playing around with speech recognition systems about 20 years ago and fell in love with the topic. But also just now, at that point it was a really optimistic time in tech, you know, and I have to say I'm a lot more guarded and cautious about where technology is going now than I was then. I was hugely sort of excited by all tech, and revolution is inclusion. I still think it can and we'll definitely talk about that. But I'm much more guarded because of some of the things that have happened over recent years where where we've really not thought things through too well before unleashing very powerful technologies on society. So, yeah, i have a passion for using tech for inclusion.

Neil Milliken:

I think that there's so much to be excited about in the realms of speech and language where, you know, as someone that has been shouting at computers for decades with varying degrees of success, you know the accuracy of speech recognition systems now is just phenomenal. It's not perfect, right? We? you know we are supported by my clear text. They're human captioners, you know. They'll quickly tell you. You know that humans still trump machines in many circumstances, but that's it.

Neil Milliken:

I've been witness to how much better this stuff has got and how much more flexible it's got with multiple speakers and with no training, because I spent an awful long time reading, you know Alice fell down the rabbit hole to IBM Firevoice and there are various different training scripts for Dragon Naturally Speaking. So, yeah, very excited about the potential for speech interfaces and quality captions everywhere, because, whilst humans are still better, there aren't enough qualified humans to do a good job of it. So you know there's a balance. So tell us a bit about how you've been working because I know we were talking about this as well to balance that you know and play with the technology and balance that sort of quality of humans and scalability of tech.

Dan Sommer:

Yeah, it's been so interesting, like you say, to see the progress over time. And what really got me hooked was when IBM Watson came out and the whole Jeopardy thing, you know, and the level of the speech to text. At that point I think it was around 2016, I was like, wow, this is amazing, It's better than ever before, And already then I was trying to see how can we use ASR, you know, and maybe edit the ASR, And what we found pretty quickly was that there were just so many different settings where captions need to be of a certain quality And the ASR really was only good in a handful of them, And so we kind of left that for a while And then tried to identify or break up the captioning process to the parts where people are really well suited. So there's, you know, transcribers really love transcribing, but they don't always love the sinking portion or the breaking up of captions. You know that part of it, And so we really tried to dive in and figure out what are the parts that are needed and then build as much technology around that so that the logistics of file handling and everything work well. And, yeah, what we're seeing now is, I think, the ASR we were talking a little bit about whisper before, you know, and seeing that kind of got my imagination going again on how can we better identify? this time? I think you mentioned earlier we maybe moved ahead, or we're very tempted to move ahead And we're not always super cautious about making sure that it's going to be the correct. You know the correct tool for the purpose And you know so.

Dan Sommer:

Really understanding whether something is for general access or individual access or something in between is super important, And also just kind of figuring out now where are human beings going? where are human captioners in either real time or post production? How can we leverage that good work that they do? Because we're still, like you said, humans are really the best at handling certain things, like multiple speakers, crosstalk, accents And, additionally, just making sure that things like environmental sounds are taken into consideration Car horn alarm, beeping. I had a student once who was in a lab and even if ASR was perfect, it wouldn't have picked up the fire alarm that was going on in the background, And so they're as great as the speech to text is becoming, You know it's still kind of lacking in in certain areas, But I think the challenge right now and what the biggest conversation is to really identify, in a very narrow sense, where are those places where it can be used?

Dan Sommer:

Because, like you said, there's not enough human captioners to go around for everything, But we want to make sure that we're giving people quality product. You know that's going to be usable, And sometimes ASR is great in meetings where it's just one or two people. You know this is a great setting. But in classrooms, where there's discussion going on, you know you still need a human. So I think the biggest thing we're doing right now is just kind of taking an inventory of all the kind of requests that come in, all the settings that we're working in and trying to redefine. You know, redefine what tools can be used where effectively.

Antonio Santos:

So then, considering all the changes that end up taking place over the last couple of months, how do you see the future of feature accessibility services evolving too?

Dan Sommer:

Yeah, that's a great question If I had a full answer. I, you know, anyway, think we're all looking for that answer of how it's going to go in the future. But I see one place in particular is I'm wondering if leveraging human transcription might be better suited for things like auto translation or Summarization tools. We're talking a lot about chat bots and things and the new chat, gpt, and being able to give it custom data and so, again, kind of identifying where that human transcript or those human captions are going to be best suited. But also, then, what? how can we use this new emerging technology to build on top of quality, quality products? and that might be, you know, for schools or classroom, having the transcript go into a box so that the student can interact with it In a more meaningful way after class, you know, can it be translated to other languages so more people have access and the quality of that translation is better, you know, or the auto translation is better.

Dan Sommer:

Yeah, in terms of providers, can can providers leverage these tools to train themselves better, to learn how to work in contexts and Subject matters that they are unfamiliar with, without having to go in cold or without having to kind of start at the bottom, at the, you know, at the cost of the consumer. Is there more prep that providers can do? our Providers going to need to expand their skill sets to include more accessibility services? So, yeah, i don't quite know exactly how everything's going to transform, you know, in terms of consumers or providers, but I think there's a lot of interesting Conversations to have, and one of the great things about us all being an accessibility is we're really used to having conversations about needs and wants and desires, and so we're really well poised to Engage in this right now and help move things forward in a way that's going to work for everybody and make sure we don't fall into too many, you know, pit holes.

Antonio Santos:

No, if you look at, you know the business model of Companies working in this space and trying to balance What is the right pricing model. No, we have people who are not able to know. Even if I look today to some of the technology That are in the market, some people are not able to afford that 60 minutes Call the captions. They might do it. They might do a lot of manual work How we are able to find a balance between every company is doing this, making money at the same time allowing everyone to somehow Be being being able to use them in the right way.

Dan Sommer:

Yeah, yeah, that's. That's always a question like budget, you know, and making sure that it's affordable. And yeah, human captions still cost quite a bit. Even when we are able to utilize the tech, someone needs to look over it, refine it, whether that's a professional or someone within the organization or an employee or something.

Dan Sommer:

I think one of the One of the things that might be helpful as we learn about how ASR works or the ASR tools work in different contexts Making sure that the speakers and the participants are engaging in a way that's going to optimize the ASR.

Dan Sommer:

So making sure that participants don't talk over each other, making sure you have a good microphone, you know, quality internet. There's certain things that we can do to prepare and Make sure that the ASR is going to work best, and so if you are in one of those settings where you really need to use ASR, you know just being aware of those considerations so that you're getting the most out of it And it's going to be, you know, most effective for everybody. And then you know just, i think a lot of times we're thrown into positions or worse, we have a certain platform And we have to use a certain ASR. We don't have a choice. Certain platforms use their own, others give you options, you know, and just so you know, taking the time to experiment a little and then provide guidance to the people you're going to be working with or interacting with on that exchange.

Neil Milliken:

Okay, thank you. So, yes, I mean, a lot of people, especially in enterprises, are locked into a technology stack and and that means you're locked into choices about Speech recognition, in-built assisted features, etc. So depending on what your CIO has chosen will actually have an impact on what features you can use. I'm really interested in another aspect of what you do, which is not captions, which is audio description. So I think that this is an area that's not so well understood. People understand captions.

Neil Milliken:

You know the Gen Z and, to a certain extent, you know millennials and people like me often use captions. Right Loads of people use captions not from necessarily an accessibility point of view, but from a convenience point of view, from a lifestyle point of view. People understand captions I think pretty well now, but audio description is, you know, not as well understood, is certainly not as well integrated into video platforms And, being totally honest, as a dyslexic person, producing audio descriptions and image descriptions puts me through my own disability pain barrier because I'm having to create content and use words and all the rest of it. That sometimes can be challenging or take energy. So I think that I'm really interested to hear about how you're applying technology to create audio description and then where you're applying it and in what context.

Dan Sommer:

Yeah, absolutely So. Yeah, like you said, audio description is really not terribly well understood, especially online. I didn't really know too much about it until a few years ago and it actually started in theaters and operas and plays and galleries and museums as a way for people who are blind or have low vision to engage with the visual components of what's happening on stage or in the painting or in the gallery and things like that. And then that kind of I'd call it theatrical audio description started taking hold with online entertainment, so TV shows and films and things like that, and so most people who have seen it know it through things like Netflix and the Netflix series, because they're all very good at making sure there's audio description. But when we look to more kind of casual or non theatrical settings online WCAG has a lot of requirements about it It actually appears three times and it's super confusing.

Dan Sommer:

If you look at it, you know, and so what most of the time we see is called standard audio description And that's where the film or the video is the same duration and in the silences or between the dialogue, you hear the narration of what's going on in the screen And things like educational videos which are, you know, or lectures or talking head videos. There's often not a gap in the, in the dialogue to describe what's on screen, and if the professor or the person talking hasn't really said what's on the slide, someone's missing a whole bunch of information, and so extended is where you have to go in and create a whole other version by pausing and freezing the frame and then inserting the narration And it's the whole process is. It's it's been very complicated because it involves many people, unlike captioning, where you can do it with one person if you want or you can break it up. It's very rare to find someone who is equipped to do the script writing, the narration and then the video editing. So those have been kind of the challenges, just the logistical challenges, but where we can see where there's some really useful advances with AI and audio description, are actually using whisper.

Dan Sommer:

There was there's a research project going on at Oxford and they took whisper and they reviewed 7000 something you know videos with audio description and they were able to tag where the speakers were, tag where the audio description was and combine. What I found so interesting was that they use the context from the whisper transcripts In order to inform where where there could be audio description narration and then provided notes on what are the key components and what's going to be helpful to To include. And that aspect of audio description creation has also been very time consuming going through you know three or four times to get a sense of what's going to be most meaningful. Where can you put, where should you be putting in audio description narration, and using context you can speed up this process quite a bit. It's it's not meant to replace the process or automated completely, but it can do about 50 to 60% of the heavy lifting, and so that's been an interesting place where AI and speech to text paired with you know context. Being able to come up with context and understand context Has been super helpful. It's very promising. It's still kind of in research phase More accessible media players, media players that support audio description.

Dan Sommer:

That's been. Another challenge is just the technology aspect and in many ways it's just it's pretty straightforward. It's overlaying audio in different places and having it play at the right time. But you know we now have tools.

Neil Milliken:

Yeah, so I does sound really good that research project, because understanding context is really important. You know it's it's all very well to describe what a room looks like, but it's the context that really makes it meaningful. And the other challenges I think you just alluded to it was that lots of the platforms don't support AD. So you've got, you've got on demand players like I player and Netflix and so on, and they support audio description, but a lot of the online platforms don't, or even if they do support it, it's challenging for individuals to be able to upload it or to know how to create that separate audio description track, rather than you know if they're doing extended audio description or if they want people to be able to toggle it on and off. So so I think there's still quite a bit of work to be done there. Are there particular players or platforms that you think are doing this well right now?

Dan Sommer:

Um, it is a challenge And, yeah, i think YouTube recently allowed for a secondary video to be uploaded with with stea allows for a secondary audio track to be uploaded, and this is also, i think, part of the issue. there's no standards on what is the final output, and so if you're going to load it to YouTube, that's great, but then you have to create a whole new video. If you're going to load it to with stea, you just need to upload a secondary audio track. you know, similar to dubbing, you know, or voiceovers, you know. I believe Vimeo also supports it. There's a lot of players that support it one way or the other.

Dan Sommer:

but I think the biggest challenge is what you mentioned is that there's really no standards for the workflow and there's really not enough training for video online. There's tons and tons of training for theater and live settings and entertainment. You know Netflix and Amazon all of them have their own internal workflows, but they're also not compatible with each other. So I think part of the issue is just standardizing that. giving people more tools so that they can do things by themselves, make audio description outputs more easily, and that's also something I've been playing around with and working with some colleagues on on how we can support that and what tools we need to create to do that. So it's a combination of style guides, best practices and then just the technology and the workflow to support that and give people training on the different aspects of that.

Antonio Santos:

No, that part is at the moment in the way our platforms are working. Today, if I want to upload video and if I want to have it available on LinkedIn, on Twitter, on Facebook, whatever I almost need to have 15 videos Correct, one for each. So, and it's not possible to manage all that. You know you are going to do mistakes somewhere for sure, and even if you are trying to make your videos accessible as possible, it's not just not practical, you basically can't do it. And then, on the top of that, you have something else is that every platform wants to keep users on their platform Correct, so if you post a link from a platform that is accessible on another, the algorithm of that platform is not going to give a few extra points in the algorithm, definitely if you're posting natively. So it's a very complicated situation that, in fact, is not helping those who need to and consume the content.

Dan Sommer:

Yeah, that's a great point And yeah, I think I don't know how much this is considered. but yeah, like what you said, you have to have different versions of it and you have to send people to different places, And that is kind of an SEO nightmare. If you are trying to monetize your content and you have to have separate videos, it doesn't give you a real good overview of who's engaging and you have to look in a lot of different places. So I think that's also why YouTube created that, because, even though it's a secondary video, you can click a button and it's still associated with that video. And same with Wistia and the other platforms I mentioned. but yeah, a lot of times, if the platform doesn't support it, you're creating a whole bunch of videos and it's just a big mess. It's difficult to manage.

Neil Milliken:

Yeah, so, and that then becomes an issue and a reason for people to push back Because we like to say well, accessibility is not difficult. All we need to do is this create the 15 videos and the 16 different transcripts and upload them to this 43 different sites. Yep, if you're even if you're the most committed enterprise in the world with the biggest, deepest pockets, it's still going to be difficult to do all of that stuff. You're still going to be challenged for resources. So I think we need to be mindful, before we go and chastise people for not being fully accessible on some of these things that actually dependent on platform or choices or time or their own knowledge or abilities, that what we're asking for may be beyond their capabilities or their affordability, or just they don't have all of the time to do this.

Neil Milliken:

Now, as an accessibility person, me making excuses for people not being accessible sticks in the throat a little, but we also we need to be pragmatic and we need to be realistic when we're advocating for this stuff. So that's where standardization comes in, and you need that sort of agreed workflow or agreed way of doing things. Now, standards take a long time and standard bodies are interesting beasts And also often those standards aren't necessarily easy for lay people to understand. So if we want mass media and citizen journalism and all of those good things that the internet's enabled to be accessible, then we really need to rethink some of this stuff, and maybe some of the new technologies that are emerging right now can help us with that. You wanted to say something.

Dan Sommer:

Yeah, exactly, i think providing audio description right now is a challenge in many ways. Even if you have all the tools and the skills, it's time consuming And if you don't, it's very expensive. And what I think is interesting is kind of looking back at the history of closed captioning we can learn a lot. It started back in the 70s. Julia Child's baking or cooking show was the first to have open captions in 1972 or 73. And from there it wasn't until the early 80s where there was a standardization of the equipment used for broadcast television and broadcast captioning. And then, as time went on and internet became a thing, captions are still very expensive to produce But at a certain point they were spending $600 for captions, for a one-hour TV program is a small drop in the budget for a television show, but when you're a professor you have to make your own lectures and things. That's $600 an hour is cost prohibitive And so over I think the past two decades or so we've standardized a lot of captioning on the web.

Dan Sommer:

We've provided more tools to make it accessible to non-professionals And I think we can learn a lot from that process in the captioning as we go forward with the audio description And seeing. You know, and I think more people are willing to have a conversation about standards. Everyone is, of course, going to have their opinion, and that's kind of what takes the longest to sort through. But I think there's just so much that we can learn from what's happened in the past with captioning as we move forward with audio description, and that makes me really hopeful And I hope that we can maybe do that in a shorter time than the few decades it took to standardize captioning. And we still disagree on that. You know there's lots of, lots of thoughts on how that should be done. So I'm hopeful.

Neil Milliken:

Okay, i think that, yeah, generally with the acceleration of adoption of technologies and processes and everything else, so am I. So thank you so much, dan. It's been a fascinating conversation. I really look forward to continuing the conversation on social media. I need to thank Amazon and my Cleartex for keeping it on air and keeping the captions and accessible. So thank you once again And it's been great, thank you.

Dan Sommer:

Thanks for having me.

People on this episode