Combating Workplace Video Fatigue with AI
The COVID-19 pandemic has accelerated the transition of enterprise video conferencing from a “nice-to-have” to a necessity. According to a Zoom blog post, their platform skyrocketed to 300 million daily meeting users in April 2020, up from 10 million in December 2019. Video content is central to everything enterprises are doing today from meetings and team building happy hours, to virtual events, training sessions and more. So much is being watched that it has also led to massive video fatigue, reminding us all that there can be too much of a good thing.
Although the pandemic might’ve been the spark to accelerate video conferencing in the workplace, it’s just beginning as hybrid work environments are likely here to stay. A new survey from Metrigy found more than 57% of enterprises expect to increase their deployments of room videoconferencing. As reliance on video accelerates, and fatigue grows, the situation begs for solutions to engage with content with far greater efficiency.
Beyond that, enterprises have an abundance of recorded video content and no simple way to surface or recall those insights at a later date to inform strategy, proposals, whitepapers, and other forms of common enterprise collateral. Fortunately, artificial intelligence (AI) is being applied to solve this burgeoning challenge.
How Can Video Engagement Improve?
The standard metric for video engagement is how long someone spends watching a video. However, that’s no longer a valid way to calculate success because it’s predicated on the notion an entire video is of value to the user and that there is unlimited time in our daily lives to consume all the relevant video content.
Clearly, this is not a realistic metric. Instead, enterprises should be focused on how quickly employees can locate the most relevant topics from video, extract the information, and apply it to their workflow, saving time and increasing productivity. We do this naturally for most other forms of content.
Consider articles, whitepapers, surveys or research reports. For each of these, we quickly learn to find valuable portions of the content and extract them for various purposes. That task is exponentially more difficult with long-form video. This is a real concern as research from Wundamail found employees were three times more likely to deliver on actions agreed in writing than video, as they failed to remember key information after ending a video-call.
AI is providing an opportunity to overcome this challenge by analyzing audio and visual cues to identify key moments in video, making it easier to index, search and recall only the important moments. Moreover, what’s important for you isn’t for me. Each viewer evaluates the important moment differently and an AI should also understand these nuances and distinctions. Audio transcription has helped navigate video, but is void of context and only ideal for note taking and following-up on action items. Training a machine to understand video is incredibly complex and based on an orchestration of triggers that are difficult even with current advancements in AI.
Many analyses must happen in tandem to surface important video moments. Visual and audio factors such as topics, speakers, speaker volume and talk-time, body language, animations, and visual aids are just some of the key ways machine learning can begin to identify important moments in video content.
For now, AI can easily analyze key words and phrases, but a big bag of words doesn’t help with actions or context. Within those larger buckets there are additional variables, such as terminology and the distinction between multiple voices (i.e. a meeting with many people using similar or dissimilar accents). Interestingly, for AI to understand the importance of a moment it often must first analyze the response or reaction to it. For example, if a speaker makes an assertion and the response is “that’s an excellent point,” AI will designate the importance of that statement.
Challenges with Indexing Enterprise Video
Much like the audio and video analysis process, the type of video content impacts the accuracy of AI. The easiest meeting structures to index are narrated solo or panel sessions, such as what you would find at a virtual conference. These typically have an established deck, flow, and controlled narrative and questions. This provides substantial “clues” to the machine learning model in order to properly index content. You’ll have the names on the screen, the person’s face, canned transitions from one subject to another, often only a few voices to decipher, and sometimes a marker around their names when they speak.
Conversely, a free-form weekly sales call might have many more people and less indicators leading to greater indexing challenges. There can be multiple people talking at the same time, fewer visual aids, and no set script or flow to the conversation, etc. For these unstructured meetings, human intervention through supervised machine learning becomes necessary to maximize accuracy. It can be trained to detect unique mannerisms, cultural language, and body language differences.
For verticals with unique jargon, AWS already has AI stacks that understand vertical ontologies, such as medicine, that can be used to address this challenge. This necessary training mechanism within each unique work culture is the reason that an ML-based platform for video content is necessary and a one-off plug-and-play solution won’t cut it.
Even before the pandemic, we’ve known that video would continue to be a primary asset with Cisco estimating that 82% of all created content will be video. With so much of the world relying on video, harnessing AI to make it more easily searchable and actionable becomes one of the most important ways to improve productivity and success in the workplace.
The concept of video fatigue may at first appear to be an isolated pandemic-related phenomenon but it is only the beginning of a larger problem if we can’t find ways to manage video more effectively moving forward. Achieving this goal with refinement and full automation is a marathon, not a sprint, and needs to start being utilized within the enterprise today.
About the author: Humphrey Chen is the CEO and Co-Founder of CLIPr, a video analysis and management (VAM) platform using AI and machine learning to help users quickly identify, organize, search, interact and share the important moments within video content. He is a corporatized entrepreneur who has bought, advised, and built start-ups in a multitude of different technology-based industries throughout his career. Prior to CLIPr, Humphrey was also the Head of Key Initiatives for the Amazon Computer Vision API’s, former Chief Product Officer for VidMob, and lead New Technologies division at Verizon Wireless during the launch of 4G LTE networks. Chen currently serves on the Board of Advisors for Noom, DialPad, GrayMeta, and VidMob. He has always had a passion for making new and meaningful things happen.