Time and Timestamps

This section gives a human-friendly explanation of how time works in MIND, with a focus on real-world use cases:

multi-device capture (mocap + video + XR),
biosignals (EEG/EMG),
training and deploying virtual and robotic agents.

Time in MIND — Big Picture

MIND is designed so that:

You can plug in any capture device (mocap, camera, HMD, EEG, robot sensors),
Record everything into one Container,
And later align all streams on a single, clean timeline for:
- offline analysis,
- model training,
- real-time agent control.

To do this, every Sample and Event in MIND carries two key timestamps plus some optional helpers.

Two Clocks: `t_monotonic` and `t_system`

Think of it like this:

t_monotonic → the device’s steady internal stopwatch
(never goes backwards, not affected by system clock changes)
t_system → the real-world clock on the wall
(can jump if the user changes the time, or NTP corrects it)

MIND requires both, because:

t_monotonic is perfect for ordering and interpolation,
t_system is perfect for:
- aligning different devices,
- matching logs, databases, or external events.

ASCII diagram:

Device A (HMD)           Device B (EEG)
t_monotonic_A            t_monotonic_B
     |                         |
     v                         v
   [steady]                 [steady]
        \                   /
         \                 /
          +--- t_system --+   (shared wall-clock timeline)

Why Microseconds?

MIND expects microsecond resolution or better for timestamps.

Why?

Motion capture, XR, and EEG can operate at hundreds or thousands of Hz.
Microseconds:
- are precise enough for these use cases,
- are still convenient integers,
- map cleanly to most OS clocks and hardware timers.

If your device only gives milliseconds:

You still store microseconds, you just multiply:
- 123 ms → 123000 µs.

If your device gives nanoseconds:

You can either:
- divide by 1000 to get microseconds, or
- store extra precision in additional_clocks.

Within a Stream: non-decreasing time

Within a single Stream:

timestamps never go backward,
they can stay the same (e.g., batch outputs or two sensors fused into one sample).

Example timeline:

Sample 0: t_monotonic = 1000000
Sample 1: t_monotonic = 1001666
Sample 2: t_monotonic = 1003333
Sample 3: t_monotonic = 1003333   (same time, different content)

This makes interpolation, resampling, and model training much simpler.

Across Streams: aligning everything

You might have:

hand_pose from an XR device,
full_body_pose from a mocap system,
video_frames from a camera,
eeg_signals from a biosensor.

MIND’s rule is:

All of these streams must be alignable onto a shared timeline.

hand_pose and full_body_pose might have slightly different t_monotonic behaviors, but their t_system values (and any sync metadata) let you bring them onto the same time axis.

ASCII sketch:

Global time (t_system, µs)
│
├─ mocap stream timestamps
├─ xr stream timestamps
├─ video stream timestamps
└─ eeg stream timestamps

MIND requires the Container to carry enough metadata (like offsets or sync description) so you can do this alignment in a robust way.

Derived timestamps

Sometimes, devices don’t give you perfect time:

cameras without reliable clocks,
legacy sensors,
data imported from old logs.

MIND allows recorders or tools to compute or fix timestamps, but:

they must track provenance:
- which timestamps are original,
- which are derived,
- who derived them.

This is important when training models or debugging a system: you want to know whether you’re looking at raw sensor timing or something that’s been “cleaned up.”

Video, frames, EEG, and other high-rate signals

For sources like video or EEG:

It’s natural to think in frames or sample indices, not just time.

MIND supports this via:

frame → for video frames,
sample_index → for high-rate streams (like EEG channels).

These work in addition to the timestamps, not instead of them.

Example:

{
  "timestamp": {
    "t_monotonic": 10003333,
    "t_system": 1710000000000000,
    "frame": 42
  },
  "image_ref": "frame_0042.png"
}

You can then:

Seek by frame number in a video editor,
Align EEG sample_index with time windows in analysis tools,
Or cross-align pose and video frames for training vision-language-action models.

Events follow the same rules

Events (like grasp start/end, manipulation, button presses) use the same timestamp object as Samples.

This makes it easy to:

determine which pose and contact Samples were “active” at the moment of an Event,
correlate user behavior with sensor readings,
train models that predict events from raw streams.

What this gives you

By following these rules, you get:

precise temporal alignment across:
- XR devices,
- mocap systems,
- cameras,
- biosensors,
- agent outputs,
data that works both:
- offline (training, analysis),
- online (real-time agents and control loops),
a clean foundation for retargeting and cross-device fusion.

Time is now something you can trust across your entire MIND ecosystem.

Time in MIND — Big Picture​

Two Clocks: t_monotonic and t_system​

Why Microseconds?​

Within a Stream: non-decreasing time​

Across Streams: aligning everything​

Derived timestamps​

Video, frames, EEG, and other high-rate signals​

Events follow the same rules​

What this gives you​