Llama-1 and Llama-2 have different vocabularies

Michael Humor
GoPenAI
Published in
1 min readSep 11, 2023

--

Although Llama1 and Llama2 both use a vocabulary of 32K tokens (n_vocab = 32000), the contents of their vocabularies are different:

Llama-2:

Token  Id
-----------
<unk> 0
<s> 1
</s> 2
<0x00> 3
<0x01> 4
<0x02> 5
<0x03> 6
<0x04> 7
<0x05> 8
<0x06> 9
<0x07> 10
<0x08> 11
<0x09> 12
...
房 31975
명 31976
两 31977
ფ 31978
才 31979
합 31980
止 31981
番 31982
ɯ 31983
奇 31984
怪 31985
联 31986
역 31987
泰 31988
백 31989
ὀ 31990
げ 31991
べ 31992
边 31993
还 31994
黃 31995
왕 31996
收 31997
弘 31998
给 31999

Llama-1:

Token  Id
-----------
â<81><87> 0
1
2
3
^A 4
^B 5
^C 6
^D 7
^E 8
^F 9
^G 10
^H 11
...
á½<80> 31990
ã<81><92> 31991
ã<81>¹ 31992
è¾¹ 31993
è¿<98> 31994
é»<83> 31995
ì<99><95> 31996
æ<94>¶ 31997
å¼<98> 31998
ç»<99> 31999

--

--