LLMs: The Important Ingredients

An overview of the pieces coming together to make LLMs work.

Xander Dunn, 27 Jan 2023

‍

This is a list of the ingredients I think are important right now for the improvement of LLMs. It's an overview of the literature on improving various aspects of Large Language Models. Reading a sampling of these will give a good understanding of current LLM capabilities and where engineering and research efforts are focused.

It's well known that papers often leave out important information for reproducing their results. This is often intentional and sometimes unintentional, sometimes information in the paper is mistakenly wrong. So, while this will give a good overview of the field, there is still a considerable gap between paper comprehension and being able to implement these methods to achieve similar results.

‍

Scaling training. This is primarily a FLOPS constrained distributed systems problem. See Megatron-LM, DeepSeed.
Scaling inference. Maximizing throughput, reducing costs, and maximizing GPU memory usage are important for making LLMs into viable products. See Speculative Sampling, Large Transformer Model Inference Optimization, and FlexGen.
Human preference alignment. RLHF / RLAIF. See InstructGPT, Hindsight Finetuning, and Pretraining with Human Preferences. See also John Schulman's talk on RLHF, and this for motivation.
Increasing context window size. This is a very important, very active area of research. We're currently limited to 4000 tokens on OpenAI's text-davinci-003, but future models will have much larger context window sizes. See FlashAttention and Transformer-XL.
Red Teaming: Evaluating models and weaknesses before deployment.

Proficient Tool Use. See Toolformer, Cascades, and Augmented Language Models
Storage and retrieval across massive datasets using embeddings. This is one approach to dealing with the combined fact that re-training models on new data is slow and expensive combined with the very small context window size. It may be the case that retrieval will always be important for LLMs because the context window will never be large enough to fit millions of documents. See RETRO.
Surfacing Uncertainty: Knowing how confident the model is in what it has done and conveying that to the user. See Language Models (Mostly) Know What They Know

Deployment and data flywheels. Improvement from human interaction. I'm increasingly of the conviction that we can't achieve human-level capabilities purely by training models in labs behind closed doors. I think direct human interaction is a vital part of learning what's useful to humans.

System 2 Thinking: Variable computation time. It's a strange thing that our current LLMs spend exactly the same amount of time on all questions, whether it's something simple or something extremely complex. It would make sense for models to expend greater computation on more complex queries, exactly as humans do. See Universal Transformers, Adaptive Computation Time (ACT) (2016).
Bootstrapping: Using the currently trained model to generate enough new high quality training data via chain-of-thought + critic model to train the next iteration of the model. If this works, we could achieve escape velocity. See Large Language Models Can Self Improve and STaR. Similarly, bootstrapping with code: Some way of exploring code, running code, modifying code, and learning from the execution of code. See LEVER.

Planning / Monte Carlo Tree Search (MCTS). See Adaptive Agents, Go-Explore, and Decision Transformer.
Multimodal: Gain a grounded understanding from experience with types of data other than language. Yann LeCun is loudly beating his drum that LLMs need grounding through experience in media other than text. He would say that most knowledge is not contained within text. This is an active area of research. See Language is Not All You Need.