Skip to main content

Time and Timestamps

This section gives a human-friendly explanation of how time works in MIND, with a focus on real-world use cases:

  • multi-device capture (mocap + video + XR),
  • biosignals (EEG/EMG),
  • training and deploying virtual and robotic agents.

Time in MIND — Big Picture

MIND is designed so that:

  • You can plug in any capture device (mocap, camera, HMD, EEG, robot sensors),
  • Record everything into one Container,
  • And later align all streams on a single, clean timeline for:
    • offline analysis,
    • model training,
    • real-time agent control.

To do this, every Sample and Event in MIND carries two key timestamps plus some optional helpers.


Two Clocks: t_monotonic and t_system

Think of it like this:

  • t_monotonic → the device’s steady internal stopwatch
    (never goes backwards, not affected by system clock changes)
  • t_system → the real-world clock on the wall
    (can jump if the user changes the time, or NTP corrects it)

MIND requires both, because:

  • t_monotonic is perfect for ordering and interpolation,
  • t_system is perfect for:
    • aligning different devices,
    • matching logs, databases, or external events.

ASCII diagram:

Device A (HMD)           Device B (EEG)
t_monotonic_A t_monotonic_B
| |
v v
[steady] [steady]
\ /
\ /
+--- t_system --+ (shared wall-clock timeline)

Why Microseconds?

MIND expects microsecond resolution or better for timestamps.

Why?

  • Motion capture, XR, and EEG can operate at hundreds or thousands of Hz.
  • Microseconds:
    • are precise enough for these use cases,
    • are still convenient integers,
    • map cleanly to most OS clocks and hardware timers.

If your device only gives milliseconds:

  • You still store microseconds, you just multiply:
    • 123 ms → 123000 µs.

If your device gives nanoseconds:

  • You can either:
    • divide by 1000 to get microseconds, or
    • store extra precision in additional_clocks.

Within a Stream: non-decreasing time

Within a single Stream:

  • timestamps never go backward,
  • they can stay the same (e.g., batch outputs or two sensors fused into one sample).

Example timeline:

Sample 0: t_monotonic = 1000000
Sample 1: t_monotonic = 1001666
Sample 2: t_monotonic = 1003333
Sample 3: t_monotonic = 1003333 (same time, different content)

This makes interpolation, resampling, and model training much simpler.


Across Streams: aligning everything

You might have:

  • hand_pose from an XR device,
  • full_body_pose from a mocap system,
  • video_frames from a camera,
  • eeg_signals from a biosensor.

MIND’s rule is:

All of these streams must be alignable onto a shared timeline.

hand_pose and full_body_pose might have slightly different t_monotonic behaviors, but their t_system values (and any sync metadata) let you bring them onto the same time axis.

ASCII sketch:

Global time (t_system, µs)

├─ mocap stream timestamps
├─ xr stream timestamps
├─ video stream timestamps
└─ eeg stream timestamps

MIND requires the Container to carry enough metadata (like offsets or sync description) so you can do this alignment in a robust way.


Derived timestamps

Sometimes, devices don’t give you perfect time:

  • cameras without reliable clocks,
  • legacy sensors,
  • data imported from old logs.

MIND allows recorders or tools to compute or fix timestamps, but:

  • they must track provenance:
    • which timestamps are original,
    • which are derived,
    • who derived them.

This is important when training models or debugging a system: you want to know whether you’re looking at raw sensor timing or something that’s been “cleaned up.”


Video, frames, EEG, and other high-rate signals

For sources like video or EEG:

  • It’s natural to think in frames or sample indices, not just time.

MIND supports this via:

  • frame → for video frames,
  • sample_index → for high-rate streams (like EEG channels).

These work in addition to the timestamps, not instead of them.

Example:

{
"timestamp": {
"t_monotonic": 10003333,
"t_system": 1710000000000000,
"frame": 42
},
"image_ref": "frame_0042.png"
}

You can then:

  • Seek by frame number in a video editor,
  • Align EEG sample_index with time windows in analysis tools,
  • Or cross-align pose and video frames for training vision-language-action models.

Events follow the same rules

Events (like grasp start/end, manipulation, button presses) use the same timestamp object as Samples.

This makes it easy to:

  • determine which pose and contact Samples were “active” at the moment of an Event,
  • correlate user behavior with sensor readings,
  • train models that predict events from raw streams.

What this gives you

By following these rules, you get:

  • precise temporal alignment across:
    • XR devices,
    • mocap systems,
    • cameras,
    • biosensors,
    • agent outputs,
  • data that works both:
    • offline (training, analysis),
    • online (real-time agents and control loops),
  • a clean foundation for retargeting and cross-device fusion.

Time is now something you can trust across your entire MIND ecosystem.