Human Robot Interface by Pointing 2

2 Geometrical framework

2.1 Viewing the plane

Consider a pinhole camera viewing a plane. The viewing transformation is a plane collineation between some world coordinate system (X,Y), and image plane coordinates (u,v), thus:

(1)
where s is a scale factor that varies for each point; and T is a 3*3 transformation matrix. The system is homogeneous, so we can fix t33 = 1 without loss of generality, leaving 8 degrees of freedom. To solve for T we must observe at least four points. By assigning arbitrary world coordinates to these points (e.g. (0,0), (0,1), (1,1), (1,0)), we define a new coordinate system on the plane, which we call working plane coordinates.
Now, given the image coordinates of a point anywhere on the plane, along with the image coordinates of the four reference points, it is possible to invert the relation and recover the point's working plane coordinates, which are invariant to the choice of camera location [7]. We use the same set of reference points for a stereo pair of views, and compute two transformations T and T', one for each camera.
2.2 Recovering the indicated point in stereo
With natural human pointing behaviour, the hand is used to define a line in space, passing through the base and tip of the index finger. This line will not generally be in the target plane but intersects the plane at some point. It is this point (the `piercing point' or `indicated point') that we aim to recover. Let the pointing finger lie along the line lw in space (see figure 1). Viewed by a camera, it appears on line li in the image, which is also the projection of a plane, P, passing through the image line and the optical centre of the camera. This plane intersects the ground plane G along line lgp. We know that lw lies in P, and the indicated point in lgp, but from one view we cannot see exactly where.

Figure 1: Relation between lines in the world, image and ground planes. Projection of the finger's image line li onto the ground plane yields a constraint line lgp on which the indicated point must lie.
Note that the line li is an image of line lgp; that is, li = Tlgp, where T is the projective transformation from equation (1). If the four reference points are visible, this transformation can be inverted to find lgp in terms of the working plane coordinates. The indicated point is constrained to lie upon this line on the target surface.
Repeating the above procedure with the second camera C' gives us another view li' of the finger, and another line of constraint lgp'. The two constraint lines will intersect at a point on the target plane, which is the indicated point. Its position can now be found relative to the four reference points. Figure 2 shows the lines of pointing in a pair of images, and the intersecting constraint lines in a `canonical' view of the working plane (in which the reference point quadrilateral is transformed to a square). This is a variation of a method employed by Quan and Mohr [8], who present an analysis based on cross-ratios.
By transforming this point with projections T and T', the indicated point can be projected back into image coordinates. Although the working plane coordinates of the indicated point depend on the configuration of the reference points, its back-projections into the images do not. Because all calculations are restricted to the image and ground planes, explicit 3-D reconstruction is avoided and no camera calibration is necessary. By tracking at least four points on the target plane, the system can be made insensitive to camera motions.

Figure 2: Pointing at the plane. By taking the lines of pointing in left and right views (a, c), transforming them into the canonical frame defined by the four corners of the grey rectangle (b), and finding the intersection of the lines, the indicated point can be determined; this is then projected back into the images.

Next
Contents