Recent Advancements in Small Language Models (SLMs)

Rajesh K
GoPenAI
Published in
4 min readDec 24, 2023

--

What are Small Language Models (SLMs)?

  • SLMs are a type of AI model that process and generate text, similar to their larger counterparts (LLMs) but with significantly fewer parameters (typically under 15 billion).
  • A “small language model” generally refers to a language model that is smaller in terms of the number of parameters (typically under 15 Billion) and the overall complexity compared to larger models of the same type.
  • This makes them more compact, efficient, and easier to deploy on devices with limited computational resources.
  • While they may not have the same breadth of knowledge or capabilities as LLMs, they can still perform a wide range of language-related tasks effectively.

Recent Advancements

Mixtral-8x7BL The Sparse Mixture of Experts (SMoE)

A striking trend in recent large language models (LLMs) like Palm and Llma2 has been the convergence towards similar neural architectures common combo of self-attention, transformers, and MLP layers.

On the flip side, Mistral AI launched a novel Language Model (LLM) Mixtral-8x7B featuring a notably distinct architecture, characterized by a sparse composition of 8 expert models.

Mixtral efficiently handles a large parameter set by selectively activating expert networks. It achieves this by:

  • Utilizing a decoder-only architecture with a unique feedforward block.
  • Employing a router network to dynamically choose two expert groups (out of eight) to process each token at every layer.
  • Combining the outputs of these experts additively.

This approach results in:

  • Increased parameter count without sacrificing speed or cost: Mixtral boasts 46.7B total parameters but only activates 12.9B per token, maintaining the processing efficiency of a smaller model.
  • Faster and more cost-effective inference compared to models with similar parameter counts.

Mixtral is trained on a massive dataset of open web content, with both experts and routers learning simultaneously.Mixtral trumps Llama 2 70B and GPT3.5 on most benchmarks.

Source — https://mistral.ai/

ORCA-2

A game-changer in language model research, Orca 2 from Microsoft demonstrates remarkable reasoning abilities, surpassing models of similar size and even rivaling giants 5–10 times larger. Its zero-shot performance on complex tasks opens new doors for future advancements.

Key Takeaways:

  1. Multi-Step Reasoning: Different tasks require different “thinking styles.” A small model tackling a complex question might do better breaking it down step-by-step, while a LLM might jump straight to the answer.
  2. Learning from the best: Studying how LLM approach tasks helps us tailor strategies for smaller models. This allows them to leverage their “leaner” structure while still tackling tougher challenges.
  3. Orca 2 thrives on a carefully crafted universe of synthetic training examples. This synthetic data teaches the model diverse reasoning tricks, from step-by-step problem-solving to clever “recall-then-generate” strategies. Orca 2 also learns to choose the right thinking tool for the job, matching each task with its most effective approach. This makes it a nimble thinker, even though it’s smaller than other data-hungry models.This add a new pardigm on using RLAIF for LLM fine tuning

This new approach promises to democratize language models, making powerful language capabilities accessible to a wider range of applications and devices.

Phi-2

Among current SLM releases, Phi-2 from Microsoft Research exhibits remarkable promise. While it has a relatively smaller size compared to other models, its reasoning and language understanding capabilities are impressive.

Packing just 2.7 billion parameters, Phi-2 packs a punch in reasoning and language understanding, even rivalling models 25 times its size! The Phi series, pioneered by Phi-1 with 1.3 billion parameters, champions top performance in lean packages.

Fueled by a diverse blend of synthetic and real-world web data, Phi-2 is a Transformer-based model trained on a 1.4 trillion tokens.Phi-2 shines in predicting the next word in a sequence. making it a valuable tool for both NLP tasks like text generation and coding tasks like code comprehension.

  1. Size vs. Performance: Phi-2 (2.7B) beats bigger models (7B & 13B Mistral/Llama-2) and even giants (25x larger Llama-2–70B) on tasks like coding and math.
  2. Efficiency Champion: Phi-2 outperforms similar-sized Google Gemini Nano 2, proving effectiveness despite a smaller footprint.
Source https://www.microsoft.com/
  • Despite its small size, Phi-2 dominates larger models on key benchmarks like reasoning, language, math, and coding.

Now It’s time to shift gears and check on the sample phi-2 RAG demo

Phi-2 2.7B RAG +Faiss+ llamaindex demo

--

--