Recent Advancements in Small Language Models (SLMs)

Rajesh K

Published in

GoPenAI

4 min readDec 24, 2023

What are Small Language Models (SLMs)?

SLMs are a type of AI model that process and generate text, similar to their larger counterparts (LLMs) but with significantly fewer parameters (typically under 15 billion).
A “small language model” generally refers to a language model that is smaller in terms of the number of parameters (typically under 15 Billion) and the overall complexity compared to larger models of the same type.
This makes them more compact, efficient, and easier to deploy on devices with limited computational resources.
While they may not have the same breadth of knowledge or capabilities as LLMs, they can still perform a wide range of language-related tasks effectively.

Recent Advancements

Mixtral-8x7BL The Sparse Mixture of Experts (SMoE)

A striking trend in recent large language models (LLMs) like Palm and Llma2 has been the convergence towards similar neural architectures common combo of self-attention, transformers, and MLP layers.

On the flip side, Mistral AI launched a novel Language Model (LLM) Mixtral-8x7B featuring a notably distinct architecture, characterized by a sparse composition of 8 expert models.

Mixtral efficiently handles a large parameter set by selectively activating expert networks. It achieves this by:

Utilizing a decoder-only architecture with a unique feedforward block.
Employing a router network to dynamically choose two expert groups (out of eight) to process each token at every layer.
Combining the outputs of these experts additively.

This approach results in:

Increased parameter count without sacrificing speed or cost: Mixtral boasts 46.7B total parameters but only activates 12.9B per token, maintaining the processing efficiency of a smaller model.
Faster and more cost-effective inference compared to models with similar parameter counts.

Mixtral is trained on a massive dataset of open web content, with both experts and routers learning simultaneously.Mixtral trumps Llama 2 70B and GPT3.5 on most benchmarks.

ORCA-2

A game-changer in language model research, Orca 2 from Microsoft demonstrates remarkable reasoning abilities, surpassing models of similar size and even rivaling giants 5–10 times larger. Its zero-shot performance on complex tasks opens new doors for future advancements.

Key Takeaways:

Multi-Step Reasoning: Different tasks require different “thinking styles.” A small model tackling a complex question might do better breaking it down step-by-step, while a LLM might jump straight to the answer.
Learning from the best: Studying how LLM approach tasks helps us tailor strategies for smaller models. This allows them to leverage their “leaner” structure while still tackling tougher challenges.
Orca 2 thrives on a carefully crafted universe of synthetic training examples. This synthetic data teaches the model diverse reasoning tricks, from step-by-step problem-solving to clever “recall-then-generate” strategies. Orca 2 also learns to choose the right thinking tool for the job, matching each task with its most effective approach. This makes it a nimble thinker, even though it’s smaller than other data-hungry models.This add a new pardigm on using RLAIF for LLM fine tuning

This new approach promises to democratize language models, making powerful language capabilities accessible to a wider range of applications and devices.

Phi-2

Among current SLM releases, Phi-2 from Microsoft Research exhibits remarkable promise. While it has a relatively smaller size compared to other models, its reasoning and language understanding capabilities are impressive.

Packing just 2.7 billion parameters, Phi-2 packs a punch in reasoning and language understanding, even rivalling models 25 times its size! The Phi series, pioneered by Phi-1 with 1.3 billion parameters, champions top performance in lean packages.

Fueled by a diverse blend of synthetic and real-world web data, Phi-2 is a Transformer-based model trained on a 1.4 trillion tokens.Phi-2 shines in predicting the next word in a sequence. making it a valuable tool for both NLP tasks like text generation and coding tasks like code comprehension.

Size vs. Performance: Phi-2 (2.7B) beats bigger models (7B & 13B Mistral/Llama-2) and even giants (25x larger Llama-2–70B) on tasks like coding and math.
Efficiency Champion: Phi-2 outperforms similar-sized Google Gemini Nano 2, proving effectiveness despite a smaller footprint.

Despite its small size, Phi-2 dominates larger models on key benchmarks like reasoning, language, math, and coding.

Now It’s time to shift gears and check on the sample phi-2 RAG demo

Phi-2 2.7B RAG +Faiss+ llamaindex demo

Recent Advancements in Small Language Models (SLMs)

Recent Advancements

Key Takeaways:

References

Written by Rajesh K