top of page

Minwu Kim

DeepSeek-V3: MLA, DeepSeekMoE

Minwu Kim
3월 4일
1분 분량

최종 수정일: 3월 8일

MLA 핵심:

KV 대신 head-agnostic한 latent vector c로 압축해버림. 그리고 연산시 다시 원래 차원으로 매핑.
vanilla MHA에 비해서 성능 저하 없음 (왜??)

MoE:

기존 MoE과 달리 256개의 expert 존재, idle expert 비율이 훠얼씬 높음
auxiliary loss for load balancing 없애버림
671B 중 37B만 active

댓글

bottom of page