TIWO: update on progress

 

November 2004

Reports summarising the work done in each of the three main workpackages are now available from this website.  Also available are the MPhil-PhD transfer reports by Elia Tomadaki, Andrew Vassiliou and Yan Xu which detail some of the research findings mentioned below.

 

 

May 2004

 

 

In TIWO we have developed models and algorithms for generating machine-executable representations of semantic video content from different kinds of text that describe the moving image.  Previously systems dealing with semantic video content have treated it as an inventory of events and existents, organised in space and time, but have not dealt the narrative aspects of moving images.  Video retrieval systems have tended to use visual features, or information extracted from one kind of text – typically subtitles, or closed captions.  We focussed on films where dealing with semantic content involves modelling and generating representations of a film’s narrative, i.e. a sequence of events connected by cause-effect relationships where the agents of cause-effect are often characters with mental states, goals, beliefs and desires.  Our approach is to extract and integrate information from different kinds of texts associated with films, including film scripts, plot summaries and audio description.  Results will be applied to assist audio description professionals and film viewers in retrieving and navigating digital film libraries.  Progress has been made with respect to three main challenges: cross-document co-reference; extraction of information about characters’ emotions; and, novel kinds of video browsing.

 

Cross Document Co-reference

A first step in integrating information from different texts is to identify cross-document co-reference, i.e. fragments of different texts that refer to the same entity or event.  Most previous work has concentrated on information about entities extracted from different texts of the same type, e.g. news stories.  We are working on information about events in two very different text types – plot summaries (typically about 200 words long, referring to about 10 major events in a film) and audio description (typically about 5000-8000 words long, describing the on-screen action for the visually impaired).  Our method is to select keywords for each event in the first text (plot summary) and do an IR-like search in the second text (audio description).  Selecting and matching verbs directly is not possible, for example a ‘murder’ event mentioned in a plot summary is described as a sequence of smaller actions in the audio description.  Selecting and matching the participants of events, and their grammatical roles, achieves about 50-60% precision and recall.  Ongoing work concerns ‘query expansion’ of verbs and we are evaluating existing schemes for event decomposition and knowledge representation for this task.

 

Affect in Text Describing Films

We have found that one way to access information about a film’s narrative is to concentrate on characters’ emotional states.  A character’s emotional state can be considered as their reaction to events unfolding around them, and their reaction is determined by how they think those events impact on their goals.  Thus information about characters’ emotional states can be revealing of a film’s narrative modelled as a sequence of events connected by cause-effect relationships.  We have developed a method for extracting information about characters’ emotions from time-aligned texts such as audio description and film scripts.  This information appears to be useful for video retrieval by story similarity, and for reasoning about a film’s narrative.

 

Video Browsing via Characters’ Affect States and Goals

One motivation for generating machine-executable representations of a film’s semantic content is to facilitate novel kinds of video browsing.  We are developing a video browsing system based on representations of characters’ affect states and goals.  At any point in the film the user is shown key-frames from other scenes that are related to the current scene.  The system will be evaluated in terms of how it helps users find answers to questions they have about a film, particularly of the kind ‘Why did X do Y’.