Geometry in the balance

Luc VAN GOOL

IKT/BIWI, D-ELEK, ETH, Gloriastrasse 35, 8092 Zurich, Switzerland,

ESAT/PSI, Kath..Univ. Leuven, Kardinaal Mercierlaan 94, 3001 Leuven, Belgium

Abstract. Graphics and computer vision increasingly call on image-based techniques for the creation of models. Image-based means that the models are generated directly from images. But even if an image-based approach is taken, an interesting question is in how far the models should be expressed in terms of explicit geometry. Depending on the application, a reproduction of appearance can suffice. Four developments are described, with two on either side between this frontline: mimicking 3D effects in finely structured surfaces based on texture mapping and recognition of 3D objects from local surface patches as examples where explicit geometry is avoided, and 3D shape reconstruction from plain video data and 3D face animation from learned lip motions as examples where explicit geometry is the goal.

1. Computer Vision: a To-do List

Computer vision systems still have important limitations. They need strong a priori assumptions about the scene. They show little self-diagnosis and self-learning. They usually yield quantitative output to another machine and cannot have a natural dialogue with a user. They mainly focus on static scenes, only rarely do they interpret events. They typically extract only a few cues, rather than exploring the visual input to the full. They can recognise individual objects, but not object classes. They are often too slow for real-time operation when running sophisticated algorithms. Future systems will focus more on the dynamic and continuous nature of the world. Shape, surface characteristics, and motions of an object will be exploited together. Future systems will also be more context-sensitive and opportunistic, making them more flexible and robust. As cameras will become very cheap, tens or hundreds of cameras can be combined into systems that have a more holistic view on a situation than we have. Computer vision will increasingly be part of multi-modal and multi-purpose systems. An interesting question is how much explicit knowledge of object and scene geometry is actually needed in the process. This aspect is discussed a bit further next. Finally, some attempts are described to advance in the forementioned directions as well as how they use geometry to different degrees.

2. Appearance vs. Geometry

At first sight, object recognition or visualisation seem to require knowledge of the object’s 3D shape, i.e. its geometry. Yet, recent developments in computer vision and graphics show that this need not be the case. Several tasks can be solved with models that are more directly based on images of the objects. Such appearance-based approaches may even surpass the geometry-based ones when it comes to speed and the complexity of scenes they can deal with. Simultaneously, recent research has simplified the extraction of 3D geometry, by working directly from images. There is an important difference between real and virtual worlds. In the former all interactions are produced by nature. In the latter they have to be produced by the system. It is doubtful whether one can do without geometry when the complexity of such interactions increases. Sampling the space of variations and interactions with images thereof seems to quickly get prohibitive. A good compromise may lie in a combination of appearance-based and geometry-based strategies, e.g. geometry for overall shape and appearance for fine surface structure. Next, examples from this spectrum between geometry and appearance are discussed: an appearance-based method to describe viewpoint dependent surface structure, the recognition of 3D objects combining some geometry with appearance, the extraction of 3D scene geometry from plain video sequences, and the animation of faces from example 3D lip motions. All cases directly work from images and represent steps in the directions set out earlier: combining shape with surface features, increasing robustness against changes in viewpoint or illumination, capitalising on the availability of several cameras, and characterising observed dynamics, resp.

3. Example 1: Texture Maps with 3D Effects

Objects may have very fine surface geometry that is difficult to model explicitly. One can generate a realistic impression of such structure by covering a smooth geometric model by a fine texture that looks similar. Starting from images of the surface, a compact model can be made that captures the relevant perceptual information to produce arbitrary amounts of the same texture. Such models also yield features to characterise the surface. Usually, if a surface has a texture, it is because it is not really flat. As a result, oblique views of the surface entail more than a simple foreshortening in the slant direction. Effects like self-occlusions or moving shadows and highlights are clearly important. An attempt to include these effects will be illustrated.

4. Example 2: 3D Object Recognition Independent of Viewpoint

Humans easily recognise objects from different viewpoints, although this may drastically change their appearance. Initial attempts to endow computers with a similar flexibility used 3D models of the objects. Hypotheses about object identity and pose were made, and the model would be rendered to compare the result against the actual image. This hypothesise-and-verify paradigm is computationally expensive and difficult to apply to objects that are difficult to model in 3D. Recently, so-called appearance-based approaches have been introduced as an alternative. Here, models are based on plain images that form a representative sample of relevant viewing conditions. Recognition amounts to a clever type of template matching. Such direct approach requires many images and finds it difficult to handle occlusions and changing backgrounds. Instead of using explicit geometry or only raw images as a model, there is a third option. It uses only a limited amount of views. From each a series of local surface patches are extracted in a way that is invariant under changes in viewpoint and illumination. This invariance ensures that a few views suffice. The locality of features lets the system withstand occlusions and changing backgrounds. The direct use of images eliminates the need for 3D modeling.

5. Example 3: 3D Rconstruction from Video Sequences

Working directly from images has in the previous examples been presented as an alternative to the use of 3D object models. But recent research in computer vision has shown that 3D modeling from images as the only input is possible. Neither the camera parameters nor its motion need to be known. This means that the input used for appearance-based systems could actually be used to build explicit 3D models just the same. As a matter of fact, adding geometry to appearance oriented techniques such as lightfields improves the quality of the corresponding visualisations. The available geometry also simplifies augmented reality applications, where objects are added to the scene with which they have to interact.

Movie of 3D reconstruction of Arenberg Castle (AVI file, 11,8 MB)

6. Example 4: Facial Animation

Humans are very sensitive to even the smallest details in face dynamics. Hence, if facial animations have to give a realistic impression, the synthetic motions should better be extremely close to the real ones. Moreover, it is often important that faces can be animated for arbitrary head orientations. The availability of detailed 3D measurements and characterisations of face dynamics would be of great help. This seems a case where explicit knowledge of 3D geometry in the end yields higher compactness and increased flexibility. Convincing demonstrations have been given where face animation amounts to composing fronto-parallel keyframe images of faces, but generalisations towards arbitrary viewpoints would increase the amount of necessary keyframes enormously. In the propounded approach 3D face dynamics are learned and subsequently `transplanted’ onto a virtual face.

Movie of facial animation (AVI file, 4 MB)