Reasoning about the Form and Content for Multimedia Objects (Extended Abstract)

In Proceedings of AAAI 1997 Spring Symposium on Intelligent integration and Use of Text, Image, Video and Audio, pages 89--94, Stanford University, California, 1997.


Introduction:
Due to the pervasive role of multimedia documents (MDs) in nowadays information systems, a vast amount of research has been carried out in the last few years on methods for effectively retrieving such documents from large repositories. This research is still in its infancy, due to the inherent difficulty of indexing documents pertaining to media other than text in a way that reflects their information content and, as a consequence, that significantly impacts on retrieval. Nonetheless, a number of theoretical results concerning sub-problems (e.g.\ the image retrieval problem) have been obtained and experimented with, and on top of these a first generation of retrieval systems have been built~\cite{ieee} and, in some cases, even turned into commercial products~\cite{virage,qbic}. The distinguishing feature of these multimedia retrieval systems (MRSs), and of the related research models, is the lack of a proper representation and use of the {\em content} of non-textual documents: only features pertaining to their {\em form}, being most amenable to automatic extraction through digital signal processing (DSP) techniques, are used upon retrieval. But this is disturbing, as documents, irrespective of the representation medium they employ, are to be regarded as {\em information carriers}, and as such are to be studied along two parallel dimensions, that of {\em form} (or {\em syntax}, or {\em symbol}) and that of {\em content} (or {\em semantics}, or {\em meaning}). Here, ``form'' is just a collective name for all those (medium-dependent) features of an information carrier that pertain to the representation and to the representation medium, while ``content'' is likewise a collective name for those (medium-independent) features that pertain to the slice of the real world being represented, which exists independently of the existence of a representation referring to it. The main thrust of this paper is that a data model for the retrieval of MDs (which we here take as consisting of multiple sub-documents each pertaining to possibly different media, rather than as just non-textual ``atomic'' documents) not only needs both dimensions to be taken into account, but also requires that {\em each of them be tackled by means of the tools most appropriate to it}, and that these sets of tools be integrated in a principled way in order to ensure transparent user access. Concerning the issue of tool appropriateness, we think that, inasmuch as the techniques from DSP (used e.g.\ in image and audio retrieval) are inadequate to reason about content, those from the field of knowledge representation are inadequate to deal with document form. This study addresses the problem of injecting semantics into MD retrieval by presenting a data model for MDs where sub-documents may be either texts or images. The way this model enforces the interaction between these two media is illustrative of how other media might also be accounted for. Texts and images are represented at the content level as sets of properties of the real-world objects {\em being represented}; at this level, the representation is {\em medium-independent}, and a unique language for content representation is thus adopted. This data model is logic-based, in the sense that this latter language is based on a {\em description logic} (DL -- see e.g.~\cite{Borgida95}). Texts and images are also represented at the form level, as sets of physical features of the objects {\em representing} a slice of the world; at this level, the representation is {\em medium-dependent}, so different document processing techniques are used to deal with sub-documents expressed in the different media. Although features pertaining to form are not represented explicitly in the DL, they impact on the DL-based reasoning through a mechanism of ``procedural attachments''. This implements the connection between (logical) reasoning about content and (non-logical) reasoning about form. From the point of view of the {\em semantics} of the query language, this latter connection is established by restricting the set of interpretations of the logical language to those that verify the constraints imposed at form level by the results of the text processing and DSP analysis. This mechanism for giving semantics to procedural attachments, thereby allowing to effectively merge logical and non-logical reasoning, has also been called the {\em method of concrete domains}~\cite{concretedomains}. Although the main task of our DL is reasoning about content, our DL-based query language is also endowed with the referential machinery to address the form dimension of text and images; linking the form and content of the same document is made possible by the sharing of the same DL symbols. The DL-based query language thus allows the expression of retrieval requests addressing both structural (form) and conceptual (content) similarity, and its underlying logic permits, among other things, to bring to bear domain knowledge (whose representation DLs are notoriously good at) in the retrieval process. The query language also includes facilities for fuzzy reasoning, in order to address the inherently quantitative nature of notions like ``similarity'' between text/images or between their features (word morphology, image colour, image shape, and the like). The model is extensible, in that the set of symbols representing similarity can be enriched at will, to account for different notions of similarity and methods for computing it. The resulting retrieval capability thus extends that of current MRSs with the use of semantic information processing and reasoning about text/image content. So far, the only attempts in this direction had been based on textual annotations to non-textual documents (``captions'': see e.g.~\cite{smeaton96}), in some cases supported by the use of thesauri to semantically connect the terms occurring in the text~\cite{guglielmo}; this means that text is seen as mere comment on the non-textual document, and not as an object of independent interest and therefore subject to retrieval {\it per se}. In our model text and images are both first-class citizens, and this clearly indicates how the extension to other media could be accomplished. The paper is organised as follows. Section~\ref{sec:l} deals with the ``form'' dimension of texts and images, defining the notions of {\em text layout} and {\em image layout}; these consists of the symbolic representation of form-related aspects of a text or image. Both notions are endowed with a {\em mereology}, i.e.\ a theory of parts, based on the notion of {\em text region} and {\em image region} as from digital geometry. In Section~\ref{sec:ric} we briefly introduce a fuzzy DL, discussing its use to represent document content and to ``anchor'' content representations to form representations. Document bases are defined in Section~\ref{sec:db}, while Section~\ref{sec:qil} introduce queries, categorising them with respect to the representation medium and to the dimension involved, and describing how the ``procedural attachment'' and the ``concrete domains'' methods provide a smooth integration of form- and content-based retrieval. The full paper also discusses its computational complexity and the realization of a MRS supporting the model. Concerning this latter point, we only remark that this are well within reach of the current technology. In particular, we have developed a theorem prover for a significant extension of the DL we use here~\cite{carlo31}, based on a sound and complete Gentzen-style sequent calculus; this theorem prover is currently being prototyped for subsequent experimental evaluation.