3 Tracking a pointing hand

3.1 Background

A large number of systems have been proposed for visual tracking and interpretation of hand and finger movements. These systems can broadly be divided into:
  • those concerned with gesture identification (e.g.for sign language), which compare the image sequence with a set of standard gestures using correlation and warping of the templates [9], or classify them with neural networks [10];
  • those which try to reconstruct the pose and shape of the hand (e.g. for teleoperation) by fitting a deformable, articulated model of the palm and finger surfaces to the incoming image sequence [11].

    Common to many of these systems is the requirement to calibrate the templates or hand model to suit each individual user. They also tend to have high computational requirements, taking several seconds per frame on a conventional workstation, or expensive multiprocessor hardware for real time implementation.

    Our approach differs from these general systems in an important respect: we wish only to recover the line along which the hand is pointing, to be able to specify points on a ground plane. This considerably reduces the degrees of freedom which we need to track. Furthermore, because the hand must be free to move about as it points to distant objects, it will occupy only a relatively small fraction of the pixel area in each image, reducing the number of features that can be distinguished.

    In this case it is not unreasonable to insist that the user adopt a rigid gesture. For simplicity, the familiar `pistol' pointing gesture was chosen. The pointing direction can now be recovered from the image of the index finger, although the thumb is also prominent and can be usefully tracked. The rest of the hand, which has a complicated and rather variable shape, is ignored. This does away with the need to calibrate the system to each user's hand.

    3.2 Tracking mechanism

    We use a type of active contour model [12, 13, 14] to track the image of a hand in the familiar `pointing' gesture, in real time. Our tracking mechanism was chosen mainly for its speed, simplicity, and modest demand for computer resources. A pair of trackers operate independently on two stereo views.


    Each tracker is based on a template, representing the shape of the occluding contours of an extended index finger and thumb (see figure 3). At about 50 sites around the contour (represented in the figure by dots) are local edgefinders which continuously measure the normal offsets between the predicted contour and actual features in the image; these offsets are used to update the image position and orientation of the tracker.

    The tracker's motion is restricted to 2D affine transformations in the image plane, which ensures that it keeps its shape whilst tracking the fingers in a variety of poses [15]. This approach is suitable for tracking planar objects under weak perspective; however it also works well with fingers, which are approximately cylindrical.

    The positions of these sampling points are expressed in affine coordinates, and their image positions depend on the tracker's local origin and two basis vectors. These are described by six parameters, which change over time as the hand is tracked.

    Figure 3: The finger-tracking active contour, (a) in its canonical frame (b) after an affine transformation in the image plane (to track a rigid motion of the hand in 3D). It is the index finger which defines the direction of pointing; the thumb is observed to facilitate the tracking of longitudinal translations which would otherwise be difficult to detect.


    At each time-step, the tracker searches for the maximum image gradient along each sampling interval, which is a short line segment, normal to and centred about a point on the active contour. This yields the normal offsets between points on the active contour and their corresponding edge segments in the image.

    The offsets are used to estimate the affine transformation (translation, rotation, scale and shear) of the active contour model, which minimises the errors in a least-squares sense. A first order temporal filter is used to predict the future position of the contour, to improve its real-time tracking performance. The filter is biased to favour rigid motions in the image, and limits the rate at which the tracker can change scale - these constraints represent prior knowledge of how the hand's image is likely to change, and increase the reliability with which it can be tracked.

    To extract the hand's direction of pointing, we estimate the orientation of the index finger; the base of the thumb is tracked merely to resolve an aperture problem [17] induced by the finger's long thin shape.

  • Next
  • Contents