Files
La-Fabrik/docs/technical/hand-tracking.md
T
2026-05-02 00:14:56 +02:00

4.9 KiB

Hand Tracking Technical Notes

This document describes the hand tracking system that exists in the current codebase.

Purpose

Hand tracking is a debug-stage interaction system used to test direct 3D object manipulation with a webcam. It allows a user to close their fist to grab a nearby object and move it in 3D space without relying on the center crosshair.

The feature is scoped to the debug physics scene rather than production gameplay input.

Runtime Flow

  1. The browser captures webcam frames in src/hooks/handTracking/useRemoteHandTracking.ts.
  2. Frames are sent to the local Python backend over WebSocket.
  3. The backend runs MediaPipe hand landmark detection.
  4. The backend returns hand data including landmarks, handedness, score, center point, and isFist.
  5. React stores the latest snapshot in the hand tracking provider.
  6. GrabbableObject reads that snapshot each frame and uses fist state plus raycasting to grab objects.
  7. HandTrackingLeftGlove reads the same snapshot and places the rigged gant_l model on the detected left hand in the debug physics scene.

Activation Rules

Hand tracking is intentionally gated so the webcam and backend are not used all the time.

The current activation conditions are:

  • debug mode is active with ?debug
  • scene mode is physics
  • the player is near an interaction, is holding an object, or is hand-holding an object

This keeps hand tracking active while the player is inside an interaction zone, even if the camera is not aimed directly at the object.

Backend

The backend lives in backend/ and exposes:

  • GET /health for health checks
  • WS /ws for frame input and hand tracking output

The Python process uses MediaPipe and the local model file:

backend/hand_landmarker.task

The backend sends normalized hand coordinates and landmarks. The frontend treats the values as screen-space inputs, then maps them into world space with the active Three.js camera.

Frontend Data Shape

The shared types live in src/types/handTracking/handTracking.ts.

interface HandTrackingHand {
  x: number;
  y: number;
  z: number;
  landmarks: HandTrackingLandmark[];
  handedness: string;
  isFist: boolean;
  score: number;
}

x and y are normalized camera coordinates. z is a relative depth value from MediaPipe, not an absolute world-space distance.

Grab Targeting

The hand grab logic lives in src/components/three/interaction/GrabbableObject.tsx.

The object is moved toward the visual center of the hand. That center is computed from the bounding box of all landmarks:

centerX = (minX + maxX) / 2
centerY = (minY + maxY) / 2

Starting a grab uses a slightly wider virtual hit zone. Instead of raycasting only from one point, the code casts several rays around the hand center:

  • center
  • left
  • right
  • up
  • down

If any ray hits the object while the object is within INTERACTION_RADIUS, the object enters hand-holding mode.

Depth Handling

Because MediaPipe z is relative, the frontend captures the starting depth when the grab begins:

initialHandZ = hand.z
initialHoldDistance = hit.distance

While holding, the object distance from the camera is adjusted by the change in hand depth:

holdDistance = initialHoldDistance + (hand.z - initialHandZ) * sensitivity

The final hold distance is clamped between the configured grab minimum and maximum distances to avoid unstable movement.

UI And Debug

The current debug UI includes:

  • HandTrackingDebugPanel inside DebugOverlayLayout for status, usage, loaded glove model, server state, hand count, and fist state
  • HandTrackingVisualizer for the SVG landmark wireframe
  • HandTrackingLeftGlove for the left-hand gant_l model in the R3F scene
  • r3f-perf for render performance
  • lil-gui for scene, camera, lighting, interaction, and grab controls

The hand tracking debug panel is a compact HTML grid outside the canvas. Model loaded displays gant_l when a left hand is detected, otherwise none. The hand wireframe is also HTML/SVG, not a 3D hand model.

Left Glove Model

The current left glove MVP uses public/models/gant_l/model.gltf, which contains a GLTF skin and armature. For now the model is positioned, oriented, and scaled as a whole from palm landmarks instead of driving individual finger bones.

The right hand is intentionally ignored in this MVP. The available right-hand models are static in the current assets and are not mapped to MediaPipe bones yet.

Known Limitations

  • The feature is debug-only and focused on the physics test scene.
  • MediaPipe depth is relative and can be noisy.
  • The virtual hit zone is an approximation based on multiple raycasts, not a real 3D collider.
  • There is no smoothing layer for hand position or depth yet.
  • The hand visualization is an SVG landmark wireframe.
  • The left glove follows the palm as a whole; finger-by-finger bone animation still requires a verified landmark-to-bone mapping and smoothing.