Dongseok Shim

Research Scientist, Creative AI Lab — Sony Group Corporation

I am a Research Scientist at Creative AI Lab, Sony Group Corporation in Tokyo, Japan. I received my Ph.D. from the Lab for Autonomous Robotics Research (LARR) at Seoul National University, advised by Prof. H. Jin Kim. Previously, I was a Research Intern with Seed Vision at ByteDance USA.

My research focuses on multimodal generative AI and robotics, including video-to-audio generation, text-to-3D generation, multimodal human motion generation, depth estimation, and reinforcement learning.

Dongseok Shim profile photo

News

Experience

Sony Group Corporation — Tokyo, Japan
Research Scientist, Creative AI Lab · Apr. 2025 – Present

Working on multimodal generative models including video-to-audio generation and multimodal human motion generation.

ByteDance (TikTok) — San Jose, CA, USA
Research Intern, Seed Vision · Apr. 2024 – Sep. 2024

Mentors: Yichun Shi, Peng Wang. Working on relightable text-to-3D generative models.

Education

Ph.D. in Artificial Intelligence
Seoul National University · Advised by Prof. H. Jin Kim
2020 – 2025 · Seoul, Korea
B.S. in Mechanical Engineering
Seoul National University
2016 – 2020 · Seoul, Korea

Publications

denotes first or co-first author  ·  1 equal contribution

Odoriko
Odoriko: A Shape-Aware Multimodal Diffusion Framework for Human Motion
ECCV 2026 project page / arXiv

A shape-aware human motion framework that supports generation and estimation from multimodal inputs, including text, music, video, and 2D poses, while improving biomorphological plausibility.

Echoes Over Time
Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models
CVPR 2026 project page / arXiv

A multimodal hierarchical network (MMHNet) that enables scalable video-to-audio generation, achieving long-form synthesis by training on short instances and generalizing to extended durations.

PTC-Depth
PTC-Depth: Pose-Refined Monocular Depth Estimation with Temporal Consistency
Leezy Han, Seunggyu Kim, Dongseok Shim, Hyeonbeom Lee
CVPR 2026 project page / arXiv

A consistency-aware monocular depth estimation framework that leverages wheel odometry and optical flow to produce temporally stable and accurate depth predictions.

EUGens
EUGens: Efficient, Unified and General Dense Layers
NeurIPS 2025 arXiv

A unified and efficient dense layer that approximates fully connected feedforward layers with linear-time complexity, enabling scalable and resource-efficient neural networks.

Periodic Skill Discovery
Jonghae Park, Daesol Cho, Jusuk Lee, Dongseok Shim, Inkyu Jang, H. Jin Kim
NeurIPS 2025 project page / arXiv / code

An unsupervised reinforcement learning framework that encodes states in a circular latent space to discover diverse periodic behaviors for locomotion.

SINCRO
Single-view 3D-aware Representations for Reinforcement Learning by Cross-view Neural Radiance Fields
Daesol Cho1, Seungyeon Yoo1, Dongseok Shim, H. Jin Kim
IEEE RA-L 2025 project page / paper / code

A reinforcement learning framework that learns 3D-aware representations from single-view RGB via masked ViT and NeRF, enabling viewpoint-robust robot manipulation without multi-view supervision.

MVLight
MVLight: Relightable Text-to-3D Generation via Light-conditioned Multi-View Diffusion
Dongseok Shim, Yichun Shi, Kejie Li, H. Jin Kim, Peng Wang
arXiv 2024 arXiv

A light-conditioned multi-view diffusion model that improves text-to-3D generation by disentangling lighting-dependent and invariant components for enhanced relighting and geometry.

SEDiff
SEDiff: Structure Extraction for Domain Adaptive Depth Estimation via Denoising Diffusion Models
Dongseok Shim, H. Jin Kim
ECCV 2024 pdf

A latent diffusion-based framework that removes domain-specific components and preserves structural consistency for domain-adaptive monocular depth estimation.

DIVIDE
DIVIDE: Learning a Domain-Invariant Geometric Space for Depth Estimation
Dongseok Shim, H. Jin Kim
IEEE RA-L 2024 paper

A depth estimation framework that learns domain-invariant geometric representations by disentangling domain-specific components with Gram matrix representations.

Mono-camera target chasing
Mono-camera-only Target Chasing for a Drone in a Dense Environment by Cross-modal Learning
Seungyeon Yoo1, Seungwoo Jung1, Yunwoo Lee, Dongseok Shim, H. Jin Kim
IEEE RA-L 2024 project page / paper / video

Learning unified cross-modal representations from RGB, depth, and semantic inputs for enhanced drone-based target tracking.

DiffuPose
DiffuPose: Monocular 3D Human Pose Estimation via Denoising Diffusion Probabilistic Model
Jeongjun Choi1, Dongseok Shim1, H. Jin Kim
IROS 2023 arXiv / code

A first diffusion-based framework for monocular 3D human pose estimation that generates diverse 3D pose hypotheses from a single 2D keypoint input with GCN.

SNeRL
SNeRL: Semantic-aware Neural Radiance Fields for Reinforcement Learning
Dongseok Shim1, Seungjae Lee1, H. Jin Kim
ICML 2023 arXiv / code

A semantic-aware NeRF-based framework that jointly learns 3D-aware and object-centric representations for improved reinforcement learning performance.

SwinDepth
SwinDepth: Unsupervised Depth Estimation using Monocular Sequences via Swin Transformer and Densely Cascaded Network
Dongseok Shim, H. Jin Kim
ICRA 2023 arXiv / code

Swin Transformer-based encoder and densely cascaded decoder architecture for unsupervised depth estimation using monocular sequences.

S2P
S2P: State-conditioned Image Synthesis for Data Augmentation in Offline Reinforcement Learning
Daesol Cho1, Dongseok Shim1, H. Jin Kim
NeurIPS 2022 arXiv / code

A generative state-to-image framework that bridges state and image domains to improve generalization in image-based offline reinforcement learning.

FS-NCSR
FS-NCSR: Increasing Diversity of the Super-Resolution Space via Frequency Separation and Noise-Conditioned Normalizing Flow
Ki-Ung Song1, Dongseok Shim1, Kang-wook Kim1, Jae-young Lee, Younggeun Kim
CVPR NTIRE Workshop 2022 arXiv / code

A frequency-separated, normalizing flow (NF)-based super-resolution framework that generates diverse and high-quality outputs by modeling high-frequency details.

Geometric Representation
Learning a Geometric Representation for Data-Efficient Depth Estimation via Gradient Field and Contrastive Loss
Dongseok Shim, H. Jin Kim
ICRA 2021 arXiv / code

A self-supervised learning approach for monocular depth estimation that leverages gradient-based representations and momentum contrastive loss to capture geometric information.

Honors & Awards

2025
Best Ph.D. Dissertation Award (Honorable Mention), Graduate School of Artificial Intelligence, Seoul National University
2022
Runner-up at NTIRE 2022 Challenge on Learning the Super-Resolution Space

Academic Service

Reviewer: CVPR, ICCV, ECCV, ICLR, NeurIPS, AAAI, ICML, ICRA, IROS, 3DV, etc.