Josh BeckmanUse a small model to generate a ‘draft’ output, then use a larger and smarter model to score the ‘draft’, then use a rejection sampling scheme to accept the tokens which are agreed by the small and large models.
In tests, they find that a draft model can give them speedups ranging between 1.92X (on a summarization benchmark called XSum) and 2.46X on a code generation task called HumanEval.
✉️FROM:Josh BeckmanFwd: Import AI 317: DeepMind Speeds Up Language Model Sampling; Voice Cloning Tech Gets Abused; More Scaling Laws for RL
Reference
- Notes
- ai, performance, scalability
- Fwd: Import AI 317: DeepMind Speeds Up Language Model Sampling; Voice Cloning Tech Gets Abused; More Scaling Laws for RL
- Josh Beckman
-
Permalink to
2023.NTE.189
- Edit
Widgets
Updated: |
v2.16.0-r189-g8d365d4d
|