Television In Words (TIWO) Case for Support
The idea of collateral text is emerging as an important consideration for the development of multimedia information systems and it raises interesting questions for artificial intelligence about the link between modes such as vision and language. In particular, text which is collateral to moving images may be processed to give representations of semantic video content. Television in Words (TIWO) looks at a novel kind of collateral text, audio descriptions produced by trained professionals, which provide a spoken account of on-screen scenes, actions, gestures, body language, facial expressions and cinematic techniques to enhance the appreciation of television programmes for visually impaired people: the story told by the moving images is retold in words. The overall aim of the project is to develop a computational account of narrative in multimedia systems. An exploration of how moving images can be put into words will synthesise techniques for language engineering (information extraction and corpus-based text analysis), video data modelling and knowledge representation. A system will be specified and prototyped with three main functions: to assist in the preparation of audio description scripts; to customise audio description scripts for different audiences; and, to process audio description scripts into annotations for video retrieval.
Developments during the last decade mean that vast quantities of text, speech, image and video data can be delivered to users across the globe. The challenge of giving effective access to such heterogeneous data in digital libraries has been labelled intelligent multimedia information retrieval - a multidisciplinary area including artificial intelligence and multimedia systems (Maybury 1997). In this context the idea of collateral text is emerging as an important consideration for the development of multimedia information systems and it raises interesting questions for artificial intelligence about the link between modes such as vision and language. Any text accompanying a still or moving image, sound, or even another text, may be considered collateral: the important point is that it is possible to extract information from the collateral text which cannot be extracted from the original source. Consider for example, a caption next to a painting in a gallery or a meteorologist's explanation of a satellite weather sequence: in these cases the text elucidates the visual information for non-expert viewers (which could include machines). In the case of moving images collateral text may be processed into machine-executable representations of semantic video content.
TIWO is a prototype R&D project which will synthesise
techniques for language engineering, specifically information extraction and
corpus-based text analysis, video data modelling and
knowledge representation to develop a computational account of narrative in
multimedia systems. The project will look at a novel kind of collateral text,
audio descriptions, which provide a spoken account of on-screen scenes,
actions, gestures, body language, facial expressions and cinematic techniques
to enhance the appreciation of television programmes
for visually impaired people. Legislation means that audio descriptions are
becoming increasingly available in the
Story telling and story understanding are studied by scholars of narrative who are interested in the processes by which a series of events is recounted or comprehended such that connections are established between the events and that preceding events set a context for the understanding of those that follow. Narrative is a broad term used in the discussion of scientific texts (Bazermann 1988), in literary criticism (Ricoeur 1991) and in the history and philosophy of science literature which deals with the development of science as the narrative of discovery (Kuhn 1977). For some, narrative abilities whether considered as a mode of thought or of discourse are fundamental to intelligence, such that "we organize our experience and our memory of human happenings mainly in the form of narrative"; so it might follow that "[it is not sufficient] to equate representations with images, with propositions, with lexical networks, or … [with] sentences" (Bruner 1991). In artificial intelligence Roger Schank and colleagues have given an account of story understanding in terms of dynamic representations which integrate sentences/events in larger structures such as scripts, plans, goals and themes (Schank and Abelson 1977, Schank and Riesbeck 1981). More recently these ideas have featured in discussions of human memory and learning which emphasised the importance of indexing and retrieving old stories in order to make sense of new situations (Schank 1990, Schank, Abelson and Wyer 1995). Much work on story understanding in AI focuses on short, contrived texts like newspaper stories and weather reports, and successful systems have been reported, e.g. at the series of Message Understanding Conferences. However, more elaborate story telling can occur in various modes and it is interesting to consider how the 'same' narrative may be understood in a moving image and in spoken or written language, for example.
As part of an investigation into narrative in multimedia systems a system will be specified and prototyped with three main functions: to assist in the preparation of audio description scripts; to customise audio description scripts for different audiences; and, to process audio description scripts into annotations for video retrieval. TIWO will review research in video data modelling and the use of knowledge representation schemes to deal with the semantic content of video data, as well as considering international standardisation efforts like MPEG-7. In order to process audio description scripts TIWO will review research in language engineering, paying particular attention to the methods and systems developed and made available by major R&D groups in the UK (Sheffield's GATE system), the EU (DFKI's Language Technology lab) and the USA (MITRE's Alembic system). Research will be grounded by interactions with an industry-based Round Table, which has already been formed, comprising representatives from the BBC, RNIB and SME's in the television and software industries.
Audio descriptions enhance the enjoyment of most kinds of television programme for visually impaired viewers including dramas, situation comedies, soap operas, documentaries and movies. In the gaps between existing speech they give key information about scenes, people's appearances, on-screen actions, etc. The 1996 Broadcasting Act requires digital terrestrial broadcasters to provide an increasing amount of audio-described programmes (currently up to 10% of their output). Furthermore, there is considerable potential for audio descriptions to be used by the whole television audience; e.g. to 'watch' television on audio cassettes or on WAP devices with little or no visual display. It may take 60 hours and many viewings to produce descriptions for a 2-hour film whereas a 30 minute soap opera which is almost full of dialogue and has familiar scenes may take only 90 minutes ( For an overview of audio descriptions in television as well as other entertainment and leisure activities, see the RNIB's Talking Images project at www.rnib.org.uk/talkingimages The ITC Guidelines on Standards for Audio Description can be found at www.itc.org.uk/divisions/eng_div/subtitle/Audio_Description ). Current computer software for audio description presents video data on screen alongside a text box in which the describer can mark the intervals where there is no speech in the programme and type their descriptions. The software then allows them to record their descriptions which are synchronised with the video data.
The preparation of audio descriptions requires not only an understanding of the programme being viewed but also an appreciation of the audience's expectations so that the story conveyed by the moving images is retold in words that interact with the existing dialogue and sound effects. If well managed the description scripts could be customised for individual viewers and could be used to index semantic video content. Software to assist in the preparation and management of audio descriptions may need to be grounded in a theoretical understanding of narrative, i.e. story telling, story understanding and the interactions between media and modalities. The task of processing audio descriptions is constrained to some extent by the fact that audio describers follow guidelines which restrict the language that they normally use. Preliminary observations suggest a predominance of declarative sentences in the present tense with few pronominal references; there is some resonance here with the ideas of 'local grammars' and 'controlled languages'.
Video Annotation: the role of collateral text
Throughout the history of moving images it has been commonplace for them to be accompanied by collateral texts (The term collateral text was introduced by Srihari in reference to captions accompanying newspaper photographs: she showed how information from a collateral text could contribute to image understanding (1995a)). Consider the credits and the screenplay of a movie; TV and film listings and reviews in newspapers; magazine features on the work of film-makers; and, the sleevenotes of video cassettes and DVDs. Between them, these texts may explicate information about the moving images that could not be obtained from an analysis of video data alone, e.g. a summary of the action, a genre-based classification, the appearance of the actors, the director's artistic intentions, etc. The latest kind of text to accompany moving images - audio description - appears to be particularly informative about the semantic content of video data, not only describing separate entities and actions, but also giving information that allows for the full interpretation (by a human at least) of a narrative, e.g. the connections between events.
For some kinds of video data, e.g. news, documentaries and sports programmes, it is possible to assume a degree of synchrony between the spoken words of presenters (or the scripts) and the moving images to which they may be referring. This synchrony has been exploited by systems that combine speech recognition and information retrieval technologies for video retrieval (Wactlar et al 1999, Jones et al 1997, Netter 1998); visual and textual features may be combined to classify video intervals (Satoh, Nakamura and Kanade 1999, Wachman and Picard 2001); others have segmented collateral text to segment video sequences (Mani et al 1997, Takeshita, Inoue and Tanaka 1997); text visible in video sequences has been processed for indexing-retrieval purposes using OCR (Lienhart and Effelsberg 2000). Systems have also been developed to exploit 'external' collateral text for indexing videos, e.g. to extract keywords from the HTML tags of links to video data files (Smith and Chang 1997) and to parse the production notes kept by the makers of wildlife documentary programmes (Kim and Shibata 1996). Generally these video annotation systems process text that has a relatively weak association with the moving image in order to label video intervals with keywords.
There is a need to develop systems for extracting features from external collateral text with a closer link to the moving image, like audio descriptions, so that richer representations of video content can be generated, e.g. to capture connections between events and between characters. Research at Surrey has shown the use of diverse types of collateral text to extend video annotation beyond attaching keywords to intervals, e.g. for dance sequences texts included spoken descriptions and interpretations of dance sequences, newspaper reviews of dances, dance studies textbooks and journal articles. (Salway and Ahmad 1998, Salway 1999).
Representing Semantic Video Content
Video data is spatially and temporally continuous. The stratification data model allows annotations to be attached at different temporal granularities (Davenport, Aguierre Smith and Pincever 1991) and has been extended so that a hierarchical organisation of intervals can be manipulated (Weiss, Duda and Gifford 1995). In the OVID system attributes can be assigned to any interval in the video data and inheritance of attributes by interval inclusion and other temporal operators is facilitated (Oomoto and Tanaka 1993). The MPEG-4 standard suggests the organisation of video data as audio-visual objects rather than as intervals so that explicit reference can be made to actors, objects and the background scene. Data models have been developed for specific kinds of video data, e.g. a data model for movies which structures video data in terms of shots and scenes (Corridoni et al 1996). Furthermore, for systems to process collateral text it is necessary to model links between video segments and text segments (Jiang, Montesi and Elmagarmid 1999; Salway 1999).
Once video data has been appropriately structured, annotation may be completed by the labelling of intervals and regions with keywords. To support some kinds of queries however, it may be necessary to represent the relationships between objects, people and actions, be it in propositions or in more complex knowledge structures. A number of researchers have proposed the use of knowledge representation formalisms from AI to deal with the semantic content of video data, for example: Sowa's conceptual graphs and Schank's scripts (Parkes 1989, Nack and Parkes 1997, Hartley and Parkes 2000); Schank's conceptual dependency graphs and Rumelhart's story grammars (Tanaka, Ariki and Uehara 1999); semantic networks for causal relationships (Roth 1999); and, frame-based representations (Davis 1995). These formalisms, along with current video data models, are promising for dealing with video content but little has been said about how their instantiation could be automated, e.g. by processing collateral text.
Narrative in Moving Images and in Texts: a link between vision and language?
The integration of vision and language is of interest to researchers in artificial intelligence who address the 'correspondence problem' of 'how to correlate visual information with words' (Srihari 1995b): importantly it is not just single words that are to be correlated with still and moving images but phrases, sentences and entire texts. There is also interest in how different modalities can combine in computer-based communication (McKeown et al 1998). A computational account of how moving images are put into words, particularly how a narrative can be conveyed in different modes, may enjoy a symbiotic relationship with earlier studies in cognitive psychology and linguistics. The audio description task of putting images into words is reminiscent of investigations in cognitive psychology that used verbal reporting of images to understand cognitive processes (Ericsson and Simon 1993) and of studies in language production where subject groups spoke narratives of a film they had just seen (Chafe 1980). These studies suggest ways in which humans organise sequences of events and they help to explicate the linguistic realisation of humans' narratives.
Aims and Objectives
The overall aims of TIWO are to:
Research will comprise two strands investigating video representation and collateral text, specifically audio descriptions. Results from these strands, i.e. video data models, knowledge representation schemes and language engineering techniques, will contribute to the specification and prototyping of an Audio Description system (AuDesc) for the preparation and management of audio descriptions. The system will aim to maintain the quality of descriptions whilst making their preparation more efficient, e.g. with a style-checker based on ITC guidelines. The project will also explore how descriptions can be customised for different kinds of audiences (young/old, those with/without visual memory, those wanting more/less interpretation and foreign audiences) and evaluate the use of audio descriptions for video annotation. The research efforts will be grounded throughout the project by regular meetings with the industry-based Round Table.
Video Representation
Digital TV/film production (including the production of audio descriptions)
and the management of digital archives both require high-level representations
of material content to facilitate intuitive manipulation and retrieval of
material. Research in TIWO will attempt to synthesise
current video data models and knowledge representation formalisms to: (i) model the link between moving images and audio
descriptions; (ii) represent the semantics of moving images beyond objects and
actions, e.g. to include narrative structure; (iii) enable dynamic
representations that can be built incrementally. The approach will be to adapt,
apply and evaluate existing data models and knowledge representation formalisms
in the prototype AuDesc system alongside a series of
domain modelling exercises with the Round Table.
These exercises will include brainstorming sessions and interviews with
professional audio describers to model a range of TV and film genres. This work
will benefit from experience at
Collateral Text
Audio descriptions aim to convey the visual information in a moving image so that the human audience can reconstruct the narrative told by the moving image. For a machine to process this kind of collateral text requires an understanding of how humans put moving images into words and requires the adaptation and application of language engineering techniques to extract computationally tractable representations of video content from the collateral text. Research in TIWO will investigate how humans put moving images into words following two methods. Firstly, a corpus of extant audio descriptions will be collected and analysed following standard methods of corpus-based linguistics in order to characterise the language used by audio describers at lexical, morphological, syntactic and semantic levels (Ahmad and Rogers 2001). Secondly, descriptions will be elicited from Round Table members following the method of verbal reporting (Ericsson and Simon 1993); the focus of these descriptions will be controlled by the instructions given to the describers, e.g. which aspects of the moving image to focus on, or whether to give more or less interpretation of what they see. Results from these investigations will feed into the process of adapting and applying language engineering technology to process audio descriptions in the AuDesc prototype. This work will use software made available by international research groups and will build on experience at Surrey of verbal reporting, corpus building and text analysis gained in several EPSRC and EU projects.
Detailed Research Plan
The project will be organised into five workpackages: each workpackage
has associated milestones against which performance can be measured to ensure
that the stated objectives are delivered on time. The work is to be carried out
by a postgraduate project student with five hours a week input from the PI who
will be responsible for management of the project. (A diagrammatic workplan follows).
WP1: Adapt and Apply Video Data Modelling and Knowledge Representation- 6 p.m. (months
1-10)
A review of video content representation schemes will consider standards such
as MPEG-4, MPEG-7 and the use of knowledge representation formalisms from
artificial intelligence, especially Schank's scripts,
plans and goals. Parallel to this will be a review of video data models that
structure video data in terms of intervals, objects and the spatial, temporal
and other relationships between them; including data models that link video
data with text data. Domain modelling exercises to
adapt the knowledge representation formalisms and video data models will take
place during the two Round Table meetings in the period, and during meetings
with individual Round Table members at their organisations.
Milestone 1 - month 10: Data model
/ knowledge representation for video annotation with audio descriptions
WP2: Adapt and Apply Language
Engineering Techniques - 9 person months (months 4-24)
A corpus of video data and associated audio description scripts will be
gathered. The corpus will include extant descriptions and descriptions elicited
from Round Table members in controlled situations following a verbal reporting
method. The corpus will be analysed using existing
software systems to address the question of how the narrative of a moving image
is manifested in text and to adapt language engineering techniques to process
audio descriptions, e.g. for style checking; for customisation
/ personalisation; and, for generating video
annotations. In particular Surrey's System Quirk will be used to measure
linguistic variance in the corpus and
Milestone 2a - month 11: Corpus of digital video data and audio descriptions
Milestone 2b - month 18: Analysis of corpus
Milestone 2c - month 24: Suite of language engineering techniques to
process audio description scripts
WP3: Audio Description System (AuDesc) - 12 person months (months 7-30)
An audio description system will be specified,
designed and prototyped following standard software engineering techniques. It
will incorporate the deliverables of WP1 and WP2: continuous prototyping will
facilitate quick evaluation of the results from these workpackages
with feedback from the Round Table. Requirements gathering, testing and
evaluation will be carried out in Round Table meetings and during visits to
members. The emphasis will be on the reuse of existing systems, including
systems developed at Surrey that integrate visual and textual data, and
commercial multimedia database systems like Informix. System development will
also take account of existing software products used for audio description,
like Softel's ADePT, and
off-the-shelf speech recognition and video analysis packages. Milestone 3a - month 12: AuDesc system specification
Milestone 3b - month 27: Delivery of prototype AuDesc
system to Round Table
Milestone 3c - month 30: Evaluation of prototype AuDesc
WP4: Audio Description Round Table - 1
person month (months 1-36)
The Round Table of professional audio describers and software developers will
meet at six-month intervals throughout the project; there will also be
individual sessions with members at their organisations.
Meetings will be used for domain modelling,
requirements specification and system evaluation. Members will be kept informed
of project developments through a TIWO WWW-page.
Milestone 4a: Round Table meetings at six-month intervals
throughout the project, starting in the first month
Milestone 4b: TIWO project WWW-page from the start of the project
WP5: Write up of PhD Dissertation - 8
person months (months 11-12; months 31-36)
The project has been organised to fit with the major
events of a PhD, e.g. the system specification coincides with the transfer
report and all work is scheduled to finish with six months left for writing-up.
Milestone 5a - month 12: MPhil-PhD transfer report
Milestone 5b - month 36: PhD Dissertation
Timeliness and Novelty
The development of digital multimedia systems requires an understanding of the
relationships between different kinds of media, whether it is for generating
annotations from collateral text or for the more general task of information
conversion - for example, taking information intended for visual display and
conveying it through an audio channel. There is also an increasing focus in
multimedia systems research on high-level representations of media content, as
exemplified by international standardisation efforts
like MPEG-7. TIWO is timely in both these respects since it address the link
between moving images and textual descriptions and it is concerned with the
high-level representation of video content.
TIWO's originality stems from its focus on the audio description task and its synthesis of video data modelling, knowledge representation and language engineering. Audio descriptions have not been considered as collateral text before. The nature of audio descriptions means that it is reasonable to use them as the basis for exploring narrative features of the moving image (not possible with previously considered text types) and for doing this with a range of programme/film genres (most research to date has considered either news, sports or documentary programmes).
Dissemination and Exploitation
Researchresults will be disseminated through the proceedings of international conferences and journals in the fields of artificial intelligence and multimedia systems. The TIWO Round Table will enable first-hand technology transfer to the British television industry and the software industry which supports it: the involvement of the RNIB with its international connections will help to spread the project further afield. A WWW-site will report on project progress and will include a system demonstration. The social interest of the project and its inherent connections to the media should help with public awareness.
Ahmad and Rogers (2001). Khurshid
Ahmad and Margaret Rogers, 'The Analysis of Text Corpora for the Creation of
Advanced Terminology Databases.' In: Wright, S.E. and Budin,
G., The Handbook of Terminology Management.
Ahmad and Salway (1996). Khurshid
Ahmad and Andrew Salway, 'The Terminology of Safety.' In: Proc. of 4th
International Congress on Terminology and Knowledge Engineering,
Ahmad and Salway (1997). Khurshid Ahmad and Andrew Salway, 'Safety and its Signification: A case for a language of safety.' Fachsprache - the International Journal of LSP 19 (3-4), 94-110.
Ahmad, Salway and Adshead-Lansdale
(1998). Khurshid Ahmad, Andrew Salway and
Janet Adshead-Lansdale, '(An)notating
Dance: multimedia storage and retrieval.' In: Henry Selvaraj
and Brijesh Verma (eds.),
ICCIMA '98 - Proceedings of the International Conference on Computational
Intelligence and Multimedia Applications, Victoria, Australia, 9-11 February 1998,
pp. 788-793.
Bazermann (1988). Charles Bazermann,
Shaping Written Knowledge: the genre and activity of the experimental article
in science.
Bruner (1991). Jerome Bruner, 'The Narrative Construction of Reality.' Critical Inquiry 18, pp. 1-21.
Chafe (1980). Wallace Chafe (ed.), The Pear Stories: cognitive, cultural and
linguistic aspects of narrative production. Ablex:
Corridoni et al. (1996). Jacopo M. Corridoni, Alberto Del Bimbo, Dario Lucarella and He Wenxue, 'Multi-perspective Navigation of Movies.' Journal of Visual Languages and Computing 7, pp. 445-466.
Ericsson and Simon (1993). K. A. Ericsson and H. A.
Simon, Protocol Analysis: verbal reports as data. MIT Press:
Hartley and Parkes (2000). E. Hartley and A. P. Parkes, 'MPEG-7: relevance and application to broadcasting.' SMPTE Journal 109 (7), pp. 559-563.
Jiang, Montesi and Elmagarmid (1999). H. T. Jiang, D. Montesi and A. K. Elmagarmid, 'Integrated Video and Text for Content-Based Access to Video Databases.' Multimedia Tools and Applications 9 (3), pp. 227-249.
Jones et al. (1997). Gareth Jones, Jonathan Foote, Karen Sparck Jones and Steve J. Young, 'The Video Mail Retrieval Project: experiences in retrieving spoken documents.' In: Maybury, pp.191-214.
Kim and Shibata (1996). Yeun-Bae Kim and Masahiro Shibata, 'Content-Based Video Indexing and Retrieval - A Natural Language Approach.' IEICE Transactions on Information and Systems E79-D (6), pp. 695-705.
Kuhn (1996). Thomas S. Kuhn, The
Structure of Scientific Revolutions. 3rd Edition,
Lienhart and Effelsberg (2000). R. Lienhart and
Mani et al. (1997). Inderjeet Mani, David House, Mark T. Maybury and Morgan Green, 'Towards Content-Based Browsing of Broadcast News Video.' In: Maybury, pp. 241-258.
Maybury
(1997). Mark T. Maybury, Intelligent
Multimedia Information Retrieval.
McKeown et al. (1998). Kathleen R. McKeown, Steven K. Feiner, Mukesh Dalal and Shih-Fu Chang, 'Generating multimedia briefings: coordinating language and illustration.' Artificial Intelligence 103, pp. 95-116.
Nack and Parkes (1997). Frank Nack and Alan Parkes, 'Toward the Automated Editing of Theme-Oriented Video Sequences.' Applied Artificial Intelligence 11, pp. 331-366.
Netter (1998). Klaus Netter, 'POP-EYE and OLIVE - Human Language as the Medium for Cross-lingual Multimedia Information Retrieval.' The ELRA Newsletter (European Language Resources Association) November 1998, pp. 5-6.
Oomoto and Tanaka (1993). Eitetsu Oomoto and Katsumi Tanaka, 'OVID: Design and Implementation of a Video-Object Database System.' IEEE Transactions on Knowledge and Data Engineering 5 (4), pp. 629-643.
Parkes (1989). Alan P. Parkes, 'The Prototype CLORIS System: Describing, Retrieving and Discussing Videodisc Stills and Sequences.' Information Processing and Management 25 (2), pp. 171-186.
Ricoeur
(1991). Paul Ricoeur, A
Ricoeur Reader.
Roth (1999). Volker Roth, 'Content-based retrieval from digital video.' Image and Vision Computing 17, pp. 531-540.
Salway (1999). Andrew Salway, 'Video Annotation:
the role of specialist text', PhD thesis, Department of Computing,
Salway and Ahmad (1998). Andrew Salway and Khurshid Ahmad, 'Talking Pictures: Indexing and Representing Video with Collateral Texts'. In: Procs. 14th Twente Workshop on Language Technology - Language Technology for Multimedia Information Retrieval, pp. 85-94.
Salway and Ahmad (1999). Andrew Salway and Khurshid Ahmad, 'Multimedia Systems and Semiotics:
Collateral Texts for Video Annotation'. In: IEE Colloquium Digest, Multimedia
Databases and MPEG-7, 29 Jan. 1999,
Salway and Ahmad (2000). Andrew Salway and Khurshid Ahmad, 'Computational Semiotics: a framework for integrating multimedia information?' In: Holmqvist, Kuhnlein and Rieser (eds.), Integrating Information from Different Channels in Multi-Media Contexts. Workshop at European Summer School on Language, Logic and Information - ESSLII 2000.
Salway, Ahmad and Collingham
(1996). Andrew Salway, Khurshid Ahmad, Steven Collingham. 'Safe-DIS: Computer Programs for the Safe
Design and Analysis of Urban Drainage Networks.' In: Proc. 7th International
Conference of Urban Storm Drainage,
Satoh, Nakamura and Kanade (1999). S. Satoh, Y. Nakamura and T. Kanade, 'Name-it: Naming and detecting faces in news videos.' IEEE Multimedia 6 (1), pp. 22-35.
Schank (1990). Roger Schank, Tell me a Story: narrative and intelligence. Northwestern University Press.
Schank
and Abelson (1977). Roger C. Schank and Robert P. Abelson,
Scripts, Plans, Goals and Understanding: an inquiry into human knowledge
structures.
Schank,
Abelson and Wyer (1995).
Roger C. Schank, Robert P. Abelson
and Robert S. Wyer (eds), Knowledge and Memory.
Schank
and Riesbeck (1981). Roger C. Schank and Christopher K. Riesbeck,
Inside Computer Understanding: five programs plus miniatures.
Smith and Chang (1997). John R. Smith and Shih-Fu Chang, 'Visually Searching the Web for Content.' IEEE Multimedia July-September 1997, pp. 12-20.
Srihari (1995a). Rohini K. Srihari, 'Use of Captions and Other Collateral Text in Understanding Photographs'. Artificial Intelligence Review 8 (5-6), 409-430.
Srihari (1995b). Rohini K. Srihari, 'Computational Models for Integrating Linguistic and Visual Information: A Survey.' Artificial Intelligence Review 8 (5-6), pp. 349-369.
Takeshita, Inoue and Tanaka (1997). Atsushi Takeshita, Takafumi Inoue and Kazuo Tanaka, 'Topic-based Multimedia Structuring.' In: Maybury, pp. 259-277.
Tanaka, Ariki and Uehara (1999). Katsumi Tanaka, Yasuo Ariki and Kuniaki Uehara, 'Organization and Retrieval of Video Data.' IEICE Transactions on Information and Systems E82-D (1), pp. 34-44.
Wachman and Picard (2001). J. S. Wachman and R. W. Picard, 'Tools for Browsing a TV Situation Comedy Based on Content Specific Attributes'. To appear in Multimedia Tools and Applications.
Wactlar et al. (1999). Howard D. Wactlar, Michael G. Christel, Yihong Gong and Alexander G. Hauptmann, 'Lessons Learned from Building a Terabyte Digital Video Library.' Computer February 1999, pp. 66-73.
Weiss, Duda and Gifford (1995). Ron Weiss, Andrzej Duda and David K. Gifford, 'Composition and Search with a Video Algebra.' IEEE Multimedia Spring 1995, pp. 12-25.