# L3ROcc
**Repository Path**: xapples/L3ROcc
## Basic Information
- **Project Name**: L3ROcc
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-04-18
- **Last Updated**: 2026-04-18
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
π L3ROcc: Local 3D Reconstruction with Occupancy
`L3ROcc` is a high-performance visual geometry framework designed to transform standard RGB video sequences into high-precision **3D Point Clouds**, **3D Occupancy Grids**, and **4D Temporal Observation Data**.
This project employs **$\pi^3$ (Permutation-Equivariant Visual Geometry Learning)** as its foundational reconstruction engine, and implements a fully automated pipeline for data labeling and alignment that is customized for navigation learning tasks. All processed data adheres to the **LeRobotDataset v2.1** specification. In practical testing, processing a 16-second video segment using this pipeline requires roughly 15 seconds to produce occupancy (occ) and mask data.
## β¨ Key Features
* **End-to-End Reconstruction**: Directly predicts affine-invariant camera poses and scale-invariant globally point clouds from RGB video streams.
* **Automated Voxelization**: Converts unstructured point clouds into structured Occupancy Grids.
* **Visibility Analysis**: Performs real-time ray casting based on intrinsic and extrinsic parameters of camera to compute visible regions (Visible Masks) and occlusion relationships.
* **4D Data Serialization**:
* **Sparse OCC**: Utilizes Sparse CSR matrices to store temporal occupancy, significantly reducing disk usage.
* **Packed Mask**: Implements bit-packing (via `np.packbits`) for visibility masks to optimize storage efficiency.
* **Multi-Dataset Adaptation**: Built-in generators for both `SimpleVideo` (single video) and [`InternData-N1`](https://huggingface.co/datasets/InternRobotics/InternData-N1) (large-scale datasets).
* **Professional Visualization**: Mayavi-based 3D rendering tools for generating side-by-side comparison videos of point clouds, trajectories, and occupancy.
## π‘ Future Work
- [ ] **Semantic Point Cloud**: Integrate semantic segmentation, and instance segmentation to enhance reconstruction quality.
- [ ] **Multi-modal Fusion**: Integrate depth maps to enhance reconstruction quality.
## π Quick Start
### 1. Clone & Install Dependencies
#### (1). Clone the Repository
```bash
git clone --recursive https://github.com/SCEIRobotics/L3ROcc.git
cd L3ROcc
```
#### (2). Install Python Dependencies
##### i. For Production (Generating OCC data for InternData-N1/LeRobot):
Python 3.10+ is recommended. Install the following dependencies:
```bash
conda create -n python=3.10 -y
conda activate
pip install -e .
pip install -e third_party/pi3
```
##### ii. For Visualization (Rendering dynamic videos & 3D inspection):
3D rendering requires a GUI environment. Please set up this environment on your local computer (Windows/macOS/Linux), not on the remote serverοΌ
```bash
# Run on your local machine
conda create -n python=3.8 -y
conda activate
pip install -r requirements_visual.txt
conda install -c conda-forge mayavi
```
(Note: Ensure you have a working OpenGL environment for Mayavi rendering.)
### 2.Model Checkpoints
Place the $\pi^3$ model weights (model.safetensors) and configuration files in the ckpt/ directory at the project root. If the automatic download from Hugging Face is slow, you can download the model checkpoint manually from [here](https://huggingface.co/yyfz233/Pi3/resolve/main/model.safetensors).
### 3. Run Example οΌThe pipeline supports three primary modesοΌ:
#### Mode A: Generate Visualized Dynamic Video Use this to create side-by-side comparison videos from your own footage with history frames.
```bash
python tools/run_normal_data_occ.py --video_path data/examples/office.mp4 --save_dir data/examples/outputs/ --pcd_save True --mode visual --mesh False
```
#### Mode B: Generate LeRobot-compatible Data Use this to generate the standard dataset structure for model training.
```bash
python tools/run_normal_data_occ.py --video_path data/examples/office.mp4 --save_dir data/examples/outputs/ --pcd_save True --mode run --mesh False
```
#### Mode C: Batch Process InternData-N1 Dataset To process the full InternData-N1 directory with scale alignment enabled.
This mode supports **breakpoint resumption** with the following logic:
- `mask_sequence.npz` (final result file) exists: Skip processing and continue with the next trajectory.
- `overwrite=True`: Force reprocessing and overwrite existing files.
- Otherwise: Proceed with processing.
```bash
python tools/run_intern_nav_occ.py --dataset_root data/examples/small_vln_n1/traj_data --output_root data/examples/small_vln_n1_4/traj_data --pcd_save True --overwrite False --mesh False
```
## π οΈ Pipeline Details
### 1. Data Generators
Located in `L3ROcc/generater/`, the project includes two core generators:
* **SimpleVideoDataGenerator**: Best for individual videos; automatically builds standard directory structures including `meta/`, `videos/`, and `data/`.
* **InternData-N1DataGenerator**: Designed for large-scale InternData-N1 data enhancement; supports **Scale Alignment** using Sim3 to ensure reconstruction coordinates match ground truth.
### 2. Core Configuration
Parameters can be tuned in `L3ROcc/configs/config.yaml`:
* **`pc_range`**: Spatial clipping and perception range for point cloud `[x_min, y_min, z_min, x_max, y_max, z_max]`. Where x is left-right direction, y is height direction (-y for upward), and z is front-back direction.
* **`voxel_size`**: Base size for occupancy voxels (e.g., 0.02m), which is directly related to the sparsity of the occupancy voxel map.
* **`occ_size`**: Number of voxel grids in each spatial dimension, derived from `(pc_range_max - pc_range_min) / voxel_size` with no independent configuration.
* **`interval`**: Frame sampling interval for video processing.
* **`history_len`**: Number of past frames to include in history (default: 10).
* **`history_step`**: Step size for history frame sampling (default: 2).
### 3.Dataset Structure & Contents
#### (1). InternData-N1 Format
The following structure is generated under each trajectory directory (e.g., trajectory_1) to ensure compatibility with robotics learning frameworks:
```
trajectory_1/
βββ data/
β βββ chunk-000/ # Core Geometric Assets
β βββ all_occ.npz # Global scene occupancy grid
β βββ origin_pcd.ply # Downsampled global point cloud
β βββ episode_000000.parquet # Per-frame poses and intrinsics
βββ meta/ # Metadata & Statistics
β βββ info.json # Dataset schema and feature definitions
β βββ episodes.jsonl # Episode metadata and Sim3 scale factors
β βββ episodes_stats.jsonl # Feature statistics (min/max/mean/std)
β βββ tasks.jsonl # Task descriptions
βββ videos/
βββ chunk-000/ # Temporal Sequences
βββ observation.occ.mask/
β βββ mask_sequence.npz # Temporal visibility bitmask
βββ observation.occ.view/
β βββ occ_sequence.npz # Temporal egocentric occupancy
βββ observation.video.trajectory/
βββ reference.mp4 # Original RGB source video
```
##### i. data/chunk-000/ (Core Geometric Assets)
- **all_occ.npz**: Stores the global occupancy grid of the entire scene in world coordinates.
- **origin_pcd.ply**: The initial global point cloud reconstructed from the video, optimized via voxel downsampling for efficient processing.
- **episode_000000.parquet**: A structured data table containing per-frame high-level features:
- **Camera Intrinsics_occ**: 3x3 matrices re-estimated via Least Squares/DLT based on local geometry.
- **Camera Extrinsics_occ**: 4x4 extrinsic matrices predicted by the ΟΒ³ model and aligned to world coordinates.
##### ii. meta/ (Metadata & Statistics)
- **info.json**: Defines the dataset schema, including the data types and shapes for observation.camera_extrinsic_occ and observation.camera_intrinsic_occ.
- **episodes.jsonl**: Contains episode-level constants, most notably the Sim3 Scale Factor used to align the model's relative units to real-world metric scales.
- **episodes_stats.jsonl**: Automatically calculates the statistical distribution (min, max, mean, std) for all observation vectors.
- **tasks.jsonl**: Provides task descriptions and objectives for the dataset.
##### iii. videos/chunk-000/ (Temporal Sequences)
- **observation.occ.mask/mask_sequence.npz**: A time-series of visibility masks. It uses an optimized Bit-packing format to store which voxels are currently visible within the camera's frustum.
- **observation.occ.view/occ_sequence.npz**: A time-series of egocentric occupancy data. Each frame represents the occupied voxels in the current camera coordinate system, stored as a Sparse CSR Matrix to minimize storage overhead.
- **observation.video.trajectory/reference.mp4**: The original RGB video sequence used as input for reconstruction.
#### (2). Visual Format
Outputs generated by the `visual_pipeline` are tailored for rendering and manual inspection:
| Directory/File | Description |
|---------------|-------------|
| `merge_npy_sequence_cam.npy` | Files per frame in Camera Coordinates, merging initial PCD and visible OCC. |
| `merge_npy_sequence_world.npy` | Files per frame in World Coordinates, used for rendering dynamic fused videos. |
| `merge_ply_sequence_cam.ply` | Files per frame in Camera Coordinates for 3D inspection (e.g., MeshLab). |
| `merge_ply_sequence_world.ply` | Files per frame in World Coordinates for 3D inspection. |
| `occ_only_cam_npy.npy` | Files per frame containing only visible OCC in Camera Coordinates for rendering. |
| `occ_only_cam_ply.ply` | Files per frame containing only visible OCC for 3D inspection. |
## πΊ Visualization & Toolbox
A variety of scripts are provided in `tools/visual/` for visualization and analysis:
| Script | Description |
|--------|-------------|
| `visual_simple_frame_npy.py` | Interactive single-frame debugger. Loads individual voxel .npy files, supports interactive view rotation in Mayavi, and prints real-time camera pose parameters (Position/Focal/ViewUp) to determine the optimal fixed view for video rendering. |
| `visual_simple_frame_npz.py` | Fast sparse matrix viewer. Directly reads compressed .npz or .npy files to quickly verify the integrity of generated occupancy data without decompressing the entire sequence. |
| `npy_to_world_video.py` | God's eye (World-View) fusion rendering. Generates third-person global reconstruction videos containing three key elements: true-color background point clouds, global camera trajectories , and accumulated occupancy grids . |
| `npy_to_occ_video.py` | Egocentric (First-Person) stylized rendering. Generates first-person videos with only local occupancy, using Morandi color palette for depth-gradient shading to showcase pure spatial geometric structures. |
| `video_composer_to_3.py` | 3-Panel panoramic composer. Horizontally stitches three video streams to generate the final demo video, typically including: original RGB input video, world-view fusion video, and local occupancy video. |
## π Acknowledgements
This project is built upon the following excellent works:
* [ΟΒ³](https://github.com/yyfz/Pi3)
* [Occ3D](https://arxiv.org/pdf/2304.14365)
* [SurroundOcc](https://github.com/weiyithu/SurroundOcc)
* [InternData-N1](https://huggingface.co/datasets/InternRobotics/InternData-N1)
## πΌ Core Contributors
**Nianjin Ye**1*([GitHub](https://github.com/CallMeFrozenBanana)), **Binling Huang**12*([GitHub](https://github.com/hbl-0624)),**Xi Yang**1([GitHub](https://github.com/kingkids)),**Hao Xu**3([GitHub](https://hxwork.github.io/))
1Sichuan Embodied Intelligence Robot Training Base Β Β Β 2UESTC Β Β Β 3CUHK Β Β Β * (Equal Contribution)
## π Citation
If you find this project useful for your research, please consider citing the foundational works mentioned in the **Acknowledgements**:
```bibtex
@misc{wang2025pi3,
title={$\pi^3$: Scalable Permutation-Equivariant Visual Geometry Learning},
author={Yifan Wang and Jianjun Zhou and Haoyi Zhu and Wenzheng Chang and Yang Zhou and Zizun Li and Junyi Chen and Jiangmiao Pang and Chunhua Shen and Tong He},
year={2025},
eprint={2507.13347},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.13347},
}
@article{tian2023occ3d,
title={Occ3D: A Large-Scale 3D Occupancy Prediction Benchmark for Autonomous Driving},
author={Tian, Xiaoyu and Jiang, Tao and Yun, Longfei and Wang, Yue and Wang, Yilun and Zhao, Hang},
journal={arXiv preprint arXiv:2304.14365},
year={2023}
}
@article{wei2023surroundocc,
title={SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving},
author={Yi Wei and Linqing Zhao and Wenzhao Zheng and Zheng Zhu and Jie Zhou and Jiwen Lu},
journal={arXiv preprint arXiv:2303.09551},
year={2023}
}
@misc{interndata_n1,
title={InternData-N1 Dataset},
author={InternData-N1 Dataset contributors},
howpublished={\url{https://huggingface.co/datasets/InternRobotics/InternData-N1}},
year={2025}
}
```
## π License
For academic use, this project is licensed under the MIT License. See the [LICENSE](./LICENSE) file for details. For commercial use, please contact the authors.