VDRDC Blog‎ > ‎

Automatic Algorithmic Description Project

posted Dec 6, 2011, 1:34 PM by Ender Tekin

Putting Humans In the Loop

The Smith-Kettlewell Video Description Research and Development Center is working on a number of technologies aimed at improving access to online educational video. One of our approaches involves using computer-vision and digital signal processing to automatically infer important visual information about what is happening in the video. We call this the Automated Algorithmic Description (AAD) project. This effort builds on important prior research in this field, such as the “Audio-Visual Content Extraction & Interaction” project of the E-Inclusion Research Network.

Luckily, some of the most fundamental types of descriptive information may lend themselves nicely to this automated approach. By using computer-vision techniques, we expect to be able to automatically extract such key information as scene changes, which actors are present, and the reading of on-screen text. We may even be able to identify certain types of human actions and identify specific places. By analyzing the audio sound track, we can tell the difference between music, environmental sounds, and dialog, which provides important information for scene identification as well as for appropriate placement of audio description.

The aim of the AAD project is to find ways to extract such information reliably. Once we have the information, it can be used in a number of different ways, including presenting it via refreshable Braille or synthetic speech directly to a blind viewer, or to assist a sighted narrator in the description process.

Our work is focusing on three key categories:

1. Scene categorization – “Where are we, and when does the scene change?”
2. Actor detection and identification – “Who is on-screen?”
3. On-screen text recognition – “What does the text on the screen say?”

The Active Learning Approach

When a person learns to do something new, he or she does so through an interactive process of instruction, trial and error, and feedback from others. It turns out that this is an effective way to teach computers to do things as well.

Most machine learning approaches require large amounts of example data provided in advance. The algorithm learns by comparing all of the examples in the database and trying to find features that let it predict which category the particular example should be put in. This means that humans need to go through enormous numbers of video examples in advance and indicate what the computer should know about that clip. Building these large example databases is very time-consuming and laborious.

By contrast, Active Learning is a machine learning paradigm that does not require lots of up-front labeled data. It focuses on using human feedback about critical examples to improve the accuracy of the over-all algorithm. By querying a human only about some scenes, rather than requiring a human to provide all explanations in advance, it significantly reduces the amount of labeling effort required from the users.

The Active Learning approach will allow a human describer to supplement and inform the automated methods by responding to specific questions from the system, such as “Is this a scene change?” or “Is this region an example of on-screen text?” The user can then focus his/her effort to providing corrections and nudging the algorithm in the correct direction, instead of laboriously hand labeling every event of interest. This approach will enable our automated tools to continually improve their ability to accurately find these essential pieces of visual information. An example of this approach is used in the paper "Corrective Feedback and Persistive Learning for Information Extraction” by Culotta et al., and a good survey of Active Learning literature can be found in this link.

We are now developing an interface that can allow the automated system to effectively communicate with its human teachers. We want to minimize the effort for the teachers, while maximizing the effect of every label that they provide or correct. The interface will present the automatically inferred results to the user in an intuitive way, while providing an efficient method for the user to correct errors and supply missing labels.

The active learning approach holds the potential for automated algorithmic description techniques  that significantly improve the efficiency of the learning process. We expect this to lead to reductions in effort and cost which will, in turn, lead to better access to educational video materials. Stay tuned for more information about this project in the coming months.