Recent Advancements in Small Language Models (SLMs)
What are Small Language Models (SLMs)?
- SLMs are a type of AI model that process and generate text, similar to their larger counterparts (LLMs) but with significantly fewer parameters (typically under 15 billion).
- A “small language model” generally refers to a language model that is smaller in terms of the number of parameters (typically under 15 Billion) and the overall complexity compared to larger models of the same type.
- This makes them more compact, efficient, and easier to deploy on devices with limited computational resources.
- While they may not have the same breadth of knowledge or capabilities as LLMs, they can still perform a wide range of language-related tasks effectively.
Recent Advancements
Mixtral-8x7BL The Sparse Mixture of Experts (SMoE)
A striking trend in recent large language models (LLMs) like Palm and Llma2 has been the convergence towards similar neural architectures common combo of self-attention, transformers, and MLP layers.
On the flip side, Mistral AI launched a novel Language Model (LLM) Mixtral-8x7B featuring a notably distinct architecture, characterized by a sparse composition of 8 expert models.
Mixtral efficiently handles a large parameter set by selectively activating expert networks. It achieves this by:
- Utilizing a decoder-only architecture with a unique feedforward block.
- Employing a router network to dynamically choose two expert groups (out of eight) to process each token at every layer.
- Combining the outputs of these experts additively.
This approach results in:
- Increased parameter count without sacrificing speed or cost: Mixtral boasts 46.7B total parameters but only activates 12.9B per token, maintaining the processing efficiency of a smaller model.
- Faster and more cost-effective inference compared to models with similar parameter counts.
Mixtral is trained on a massive dataset of open web content, with both experts and routers learning simultaneously.Mixtral trumps Llama 2 70B and GPT3.5 on most benchmarks.
ORCA-2
A game-changer in language model research, Orca 2 from Microsoft demonstrates remarkable reasoning abilities, surpassing models of similar size and even rivaling giants 5–10 times larger. Its zero-shot performance on complex tasks opens new doors for future advancements.
Key Takeaways:
- Multi-Step Reasoning: Different tasks require different “thinking styles.” A small model tackling a complex question might do better breaking it down step-by-step, while a LLM might jump straight to the answer.
- Learning from the best: Studying how LLM approach tasks helps us tailor strategies for smaller models. This allows them to leverage their “leaner” structure while still tackling tougher challenges.
- Orca 2 thrives on a carefully crafted universe of synthetic training examples. This synthetic data teaches the model diverse reasoning tricks, from step-by-step problem-solving to clever “recall-then-generate” strategies. Orca 2 also learns to choose the right thinking tool for the job, matching each task with its most effective approach. This makes it a nimble thinker, even though it’s smaller than other data-hungry models.This add a new pardigm on using RLAIF for LLM fine tuning
This new approach promises to democratize language models, making powerful language capabilities accessible to a wider range of applications and devices.
Phi-2
Among current SLM releases, Phi-2 from Microsoft Research exhibits remarkable promise. While it has a relatively smaller size compared to other models, its reasoning and language understanding capabilities are impressive.
Packing just 2.7 billion parameters, Phi-2 packs a punch in reasoning and language understanding, even rivalling models 25 times its size! The Phi series, pioneered by Phi-1 with 1.3 billion parameters, champions top performance in lean packages.
Fueled by a diverse blend of synthetic and real-world web data, Phi-2 is a Transformer-based model trained on a 1.4 trillion tokens.Phi-2 shines in predicting the next word in a sequence. making it a valuable tool for both NLP tasks like text generation and coding tasks like code comprehension.
- Size vs. Performance: Phi-2 (2.7B) beats bigger models (7B & 13B Mistral/Llama-2) and even giants (25x larger Llama-2–70B) on tasks like coding and math.
- Efficiency Champion: Phi-2 outperforms similar-sized Google Gemini Nano 2, proving effectiveness despite a smaller footprint.
- Despite its small size, Phi-2 dominates larger models on key benchmarks like reasoning, language, math, and coding.
Now It’s time to shift gears and check on the sample phi-2 RAG demo
Phi-2 2.7B RAG +Faiss+ llamaindex demo