VDRDC Blog‎ > ‎

What Do Choreography and Orchestration Have to do with Video Description Technology?

posted Jan 10, 2012, 2:54 PM by Owen Edwards   [ updated Jan 10, 2012, 2:59 PM ]

The VDRDC’s COVA project aims to find new ways of getting well-synchronized video description for movies and television using smartphones or mobile devices. COVA stands for Choreographed and Orchestrated Video Annotation. Its purpose is to allow blind and visually-impaired viewers to have personal access to description from a mobile device without needing the projector, TV, or DVD player to have any special modifications or setup. Although some theaters have transmitters to broadcast description to special receivers, Choreographed and Orchestrated Video Annotation would allow any theater to provide description without the need for special equipment.

 The dream of COVA will be made possible by a revolution taking place in the world of mainstream mobile entertainment technology. The TV industry is aggressively developing technology related to something called second screen viewing, where a viewer uses a laptop, smartphone or tablet to look for or share information about what they are currently watching on TV (the “first screen”). Several companies are interested in automating this dual-screen experience, comparing it to the “GPS of TV.” “How can your smartphone, iPad, or Android tablet automatically know what movie you are watching on TV, without you having to tell it?” Such a technology allows viewers to identify and “check in” on a TV show they’re viewing, much like FourSquare and Facebook allow “check ins” at physical locations based on GPS. With a second screen device that can automatically detect what the viewer is watching, the viewer can share their TV preferences and discover new shows through friends.  It can also allow the viewer to receive programming information and special offers from broadcasters and advertisers. 

 These second screen technologies use a variety of techniques for identifying what is being viewed, and could potentially identify how far into the show or movie you are.  Yahoo!’s Into_Now focuses on identifying TV shows, and purely uses audio information captured from the program via the device’s microphone. VideoSurf has a similar TV show and online video identification App, which uses both audio and video to identify the program. Neilsen (of the Neilsen ratings) has Media-Sync, which embeds an inaudible watermark into the program’s audio at the time of broadcast.

 Technologies like this could be used to provide highly customized description for individuals, using the smartphone or tablet as a second audio player, rather than a second-screen. The second-screen techniques can be used to identify video content and synchronize audio descriptions with TV shows, educational videos, and movies. The mobile device would obtain descriptions for the given content by integrating directly with the DVX server technology being developed here at Smith-Kettlewell.


This is the “choreographed” part of COVA – the mobile device plays back the descriptive audio based on its own detection and synchronization with the video source, much as a choreographed dancer follows the rhythm of a piece of music. We are also looking at orchestrated media, a term coined by BBC Research & Development in the UK. With orchestrated media, the set-top box, computer, DVD player, TV, or movie theater showing the content broadcasts additional synchronization information via infrared, Wi-Fi, or Bluetooth. This allows second-screen devices to accurately synchronize with the video playback, in the same way a conductor keeps tight time for an orchestra. Some consumer devices are already capable of broadcasting some kind of synchronization information. For example, 3D TVs transmit a synchronization signal to the 3D glasses to indicate whether a left-eye or right-eye frame is being displayed.

 Unlike the choreographed approach, orchestrated media does require the TV set, computer, or projector to include special technology. Luckily, in part because of the second-screen revolution, these modifications are being driven by mainstream requirements. For example, the Neilsen Media-Sync technology that embeds the audio watermark at the broadcast station has nothing to do with accessibility. It will ostensibly provide value-added services for broadcasters and their advertisers. However, there is no reason why COVA could not simply piggy-back onto this existing technology without needing any special accessibility infrastructure.

 Similarly, BBC R&D has been working on a technology termed Universal Control which offers a rich wireless protocol for control and synchronization of media programming. Not only would it connect smartphones and tablets to TVs and set-top boxes as advanced remote controls, but it would also supply information about what you were watching and exact timing within the program back to the controller. This would allow sophisticated second-device description with greater reliability, detection speed, and simplicity than choreographed, audio-based identification techniques.

 Again, the beauty of this approach is that it is being driven by mainstream forces, so it is likely to be broadly available. A recent demo of Universal Control by BBC R&D featured a toy “Dalek” robot being activated wirelessly by a signal from a set-top box showing an episode of the British science fiction show “Doctor Who.” The toy moves around whenever a “Dalek” robot is moving on the screen. This was a powerful demonstration of how external devices could easily synchronize with video information, and is exactly the kind of orchestrated system that the VDRDC is interested in building on. (Amusingly, the demo gained wide attention in the British press, unfortunately under the impression that the BBC would soon start marketing these toys!)


A significant advantage of the Orchestrated Media system over our basic DVX player (where the player itself adds the descriptive audio to the video presentation) is that it allows multiple blind or visually-impaired viewers to select their viewing experience according to their individual needs; a blind user might be listening to a descriptive audio track which contains a lot of description of characters, locations and action, while a visually-impaired viewer might be listening to a track that just adds information about subtle facial expressions, and action in very dark scenes. One of the greatest strengths of ‘smart’ devices is the ability to customize the experience to each user’s specific needs, which is the very essence of accessibility.

 Using mobile technologies with minimal infrastructure, COVA will open access to video content in a broad variety of settings, while allowing the consumption of description to be customized to each individual.

Like this post on Facebook