구글에서 발표, MHA와 MQA의 단점을 극복하기 위해 제시
Multi-head attention(MHA): QKV head가 H개 있음, 복잡함
Multi-query attention(MQA): key와 value의 head를 하나로 공유, query만 여러 개. 간결하고 메모리 대역폭 줄지만, 학습 불안정
grouped-query attention(GQA): KV head를 g개로 줄이는 중간 단계, LLAMA2에서 사용

1 Introduction

그 동안 NLP에서 높은 성능을 내기 위해, 모델 사이즈를 키우는 레이스를 펼침 -> 고비용과 추론속도가 계속 증가
모델의 높은 성능과 효율성의 균형을 이루기 위한 연구가 이뤄졌고, Mistral 7B도 그 일환임
- Mistral 7B는 LLaMa 34B를 수학, 코드 생성 등의 분야에서 앞섬.
- 또 none-code 벤치마크의 희생 없이 Code-Llama 7B에 근접한 성능 보임
다음 두 가지 attention mechanism을 사용해 Mistral 7B의 성능을 높임
- GQA: 추론 속도를 높이고, 디코딩에서 메모리 필요치를 줄임
- SWA: 더 긴 시퀀스를 연산량을 줄이면서 다룰 수 있게 함

2 Architectural details

Mistral 7B는 Transformer 아키텍처를 기반으로 함. LLamMa와 비교해서, 크게 세 가지 차이

Table1

Sliding Window Attention

transformer의 stacked layer를 exploit함.
window size W 안에서의 정보만 봄
즉 layer가 깊어지더라도, 최대 k(# layer) * W (# window) 만큼의 정보만 반영됨
W=4096이면, 마지막 layer에 약 131K의 토큰의 span이 반영됨
실제로, 16K 길이의 sequence와 W=4096이면, 기존 vanilla attention 보다 2배의 속도 상승

Rolling Buffer Cache

Cache size에 limit을 두어, 그 이상으로 input이 들어오면, 앞 부분부터 덮어씀
32k token length에서 모델의 퀄리티를 유지한 채, 메모리 사용량을 8배 줄일 수 있음

Pre-fill and Chunking

프롬프트가 미리 주어져있기 때문에, pre-fill 할 수 있음
W=4라고 하면, 총 12 프롬프트 토큰을 그림처럼 세 chunk로 나눌 수 있음
이 때, attention에서 “the dog go to”는 이전 chunk는 참고하지만(cache에 올라감), 첫 번 째 chunk는 참고하지 않음
이로 인해 메모리 사용량 및 연산 속도 개선

3 Results

본 논문에선 Mistral과 Llama의 성능을 주로 비교함, 사용한 태스크는 아래와 같음
- Commonsense Reasoning (0-shot)
- World Knowledge (5-shot)
- Reading Comprehension (0-shot)
- Math
- Code
- Popular aggregated results: MMLU 같은

Table2

Mistral 7B는 Llama2 7B, Llama1 13B를 전반적으로 앞지름

Size and Efficiency

reasoning, comprehension, STEM reasoning은 Llama2와 비교했을 때 x3 정도 파라미터 사이즈와 비슷한 성능

4 Instruction Finetuning

Table3

Mistral 7B의 일반화성능을 평가하기 위해, instruction dataset으로 fine-tuning. Mistral 7B Instruct
13B chat model과 비슷한 성능

5 Adding guardrails for front-facing applications

AI generation에서 front-facing applications(Chatbot 등 User Interface)의 가드레일을 강화하는게 중요한 이슈
유저를 생성된 답변으로부터 보호하고, 윤리적 이슈 지키는게 중요

5.1 System prompt to enforce guardrails

Table4

Always assist with care, respect, and truth. Respond with utmost utility yet securely. Avoid harmful, unethical, prejudiced, or negative content. Ensure replies promote fairness and positivity.

MT bench를 통해 확인한 결과, Mistral System Prompt가 우수한 성능 보임

5.2 Content moderation with self-reflection

Table5

Kill과 같은 단어도 맥락 잘 파악해서 사용해야
Mistral은 Kill Linux에 적절한 답변 생성했지만, Llama의 system prompt는 차단해버림.

6 Conclusion

Mistral은 model capabilities, training cost, inference cost를 해결하고 작은 사이즈의 모델에서 좋은 성능 보임

[PAPER] Mistral 7B

[PAPER] Mistral 7B

TLDR

Abstract

grouped-query attention(GQA)

1 Introduction

2 Architectural details

Sliding Window Attention

Rolling Buffer Cache

Pre-fill and Chunking

3 Results

Size and Efficiency

4 Instruction Finetuning

5 Adding guardrails for front-facing applications

5.1 System prompt to enforce guardrails

5.2 Content moderation with self-reflection

6 Conclusion

Join Newsletter

Written by JAE-HYEONG LEE