Running Llama on M2 Macbook

Meta released LLaMA, a state of the art large language model, about a month ago. These models needed beefy hardware to run, but thanks to the llama.cpp project, it is possible to run the model on personal machines. The following steps are involved in running LLaMA on my M2 Macbook (96GB RAM, 12 core) with Python 3.11.2.

Download the model(s)

You can either request the model officially from Meta via this form. Alternatively, you can use one of the methods listed in this PR open in the repository. There are four model flavors - LLaMA 7B, LLaMA 13B, LLaMA 30B and LLaMA 65B.

Setup environment

Install the dependencies for compiling main.cpp.

brew install pkgconfig cmake

Next, add requirements.txt with the following packages. torch for python 3.11 is installed from nightly (see below).

numpy
sentencepiece

Next, setup python virtual environment and install dependencies for converting python models (.pth) to fp16 c++ models (-f16.bin).

python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -r requirements.txt
python3 -m pip install --pre torch --extra-index-url https://download.pytorch.org/whl/nightly/cpu

Setup the model(s)

Clone the repo.

git clone https://github.com/ggerganov/llama.cpp

Compile the binary.

make

I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.0 (clang-1400.0.29.202)
I CXX:      Apple clang version 14.0.0 (clang-1400.0.29.202)

rm -f *.o main quantize
(.venv) umang@Umangs-MacBook-Pro llama.cpp % make
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.0 (clang-1400.0.29.202)
I CXX:      Apple clang version 14.0.0 (clang-1400.0.29.202)

cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -DGGML_USE_ACCELERATE   -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread -c utils.cpp -o utils.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread main.cpp ggml.o utils.o -o main  -framework Accelerate
./main -h
usage: ./main [options]

options:
  -h, --help            show this help message and exit
  -i, --interactive     run in interactive mode
  --interactive-start   run in interactive mode and poll user input at startup
  -r PROMPT, --reverse-prompt PROMPT
                        in interactive mode, poll user input upon seeing PROMPT
  --color               colorise output to distinguish prompt and user input from generations
  -s SEED, --seed SEED  RNG seed (default: -1)
  -t N, --threads N     number of threads to use during computation (default: 4)
  -p PROMPT, --prompt PROMPT
                        prompt to start generation with (default: random)
  -f FNAME, --file FNAME
                        prompt file to start generation.
  -n N, --n_predict N   number of tokens to predict (default: 128)
  --top_k N             top-k sampling (default: 40)
  --top_p N             top-p sampling (default: 0.9)
  --repeat_last_n N     last n tokens to consider for penalize (default: 64)
  --repeat_penalty N    penalize repeat sequence of tokens (default: 1.3)
  --temp N              temperature (default: 0.8)
  -b N, --batch_size N  batch size for prompt processing (default: 8)
  -m FNAME, --model FNAME
                        model path (default: models/llama-7B/ggml-model.bin)

c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread quantize.cpp ggml.o utils.o -o quantize  -framework Accelerate

Convert the model(s).

python3 convert-pth-to-ggml.py models/7B 1
python3 convert-pth-to-ggml.py models/13B 1
python3 convert-pth-to-ggml.py models/30B 1
python3 convert-pth-to-ggml.py models/65B 1

Quantize the model(s) to 4-bit.

./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin 2

./quantize ./models/13B/ggml-model-f16.bin ./models/13B/ggml-model-q4_0.bin 2
./quantize ./models/13B/ggml-model-f16.bin.1 ./models/13B/ggml-model-q4_0.bin.1 2

./quantize ./models/30B/ggml-model-f16.bin ./models/30B/ggml-model-q4_0.bin 2
./quantize ./models/30B/ggml-model-f16.bin.1 ./models/30B/ggml-model-q4_0.bin.1 2
./quantize ./models/30B/ggml-model-f16.bin.2 ./models/30B/ggml-model-q4_0.bin.2 2
./quantize ./models/30B/ggml-model-f16.bin.3 ./models/30B/ggml-model-q4_0.bin.3 2

./quantize ./models/65B/ggml-model-f16.bin ./models/65B/ggml-model-q4_0.bin 2
./quantize ./models/65B/ggml-model-f16.bin.1 ./models/65B/ggml-model-q4_0.bin.1 2
./quantize ./models/65B/ggml-model-f16.bin.2 ./models/65B/ggml-model-q4_0.bin.2 2
./quantize ./models/65B/ggml-model-f16.bin.3 ./models/65B/ggml-model-q4_0.bin.3 2
./quantize ./models/65B/ggml-model-f16.bin.4 ./models/65B/ggml-model-q4_0.bin.4 2
./quantize ./models/65B/ggml-model-f16.bin.5 ./models/65B/ggml-model-q4_0.bin.5 2
./quantize ./models/65B/ggml-model-f16.bin.6 ./models/65B/ggml-model-q4_0.bin.6 2
./quantize ./models/65B/ggml-model-f16.bin.7 ./models/65B/ggml-model-q4_0.bin.7 2

Running the model(s)

LLaMA works best with prompts so that the expected answer is the natural continuation of the prompt. It doesn’t work well for question answering. With 96GB of RAM, and 8 threads, the model takes ~500ms/token.

./main -m ./models/65B/ggml-model-q4_0.bin \
    -t 8 -n 256 --top_p 2 --top_k 40 \
    --repeat_penalty 1.176 \
    --temp 0.7 \
    -p 'Simply put, the relation between science and philosophy is '


main: seed = 1678903829
llama_model_load: loading model from './models/65B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 8192
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 64
llama_model_load: n_layer = 80
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 22016
llama_model_load: n_parts = 8
llama_model_load: ggml ctx size = 41477.73 MB
llama_model_load: memory_size =  2560.00 MB, n_mem = 40960
llama_model_load: loading model part 1/8 from './models/65B/ggml-model-q4_0.bin'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 2/8 from './models/65B/ggml-model-q4_0.bin.1'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 3/8 from './models/65B/ggml-model-q4_0.bin.2'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 4/8 from './models/65B/ggml-model-q4_0.bin.3'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 5/8 from './models/65B/ggml-model-q4_0.bin.4'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 6/8 from './models/65B/ggml-model-q4_0.bin.5'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 7/8 from './models/65B/ggml-model-q4_0.bin.6'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 8/8 from './models/65B/ggml-model-q4_0.bin.7'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723

main: prompt: 'Simply put, the relation between science and philosophy is '
main: number of tokens in prompt = 13
     1 -> ''
  8942 -> 'Sim'
 17632 -> 'ply'
  1925 -> ' put'
 29892 -> ','
   278 -> ' the'
  8220 -> ' relation'
  1546 -> ' between'
 10466 -> ' science'
   322 -> ' and'
 22237 -> ' philosophy'
   338 -> ' is'
 29871 -> ' '

sampling parameters: temp = 0.700000, top_k = 40, top_p = 2.000000, repeat_last_n = 64, repeat_penalty = 1.176000


Simply put, the relation between science and philosophy is 100% positive. It has always been so since the emergence of philosophical thought in ancient Greece. Philosophers have always relied on scientific discoveries for their ideas - to give them a realistic basis and make sure they correspond with known facts about our world (i.e., epistemology, metaphysics). Scientists are also guided by philosophy: the methodologies of science are formulated in philosophical terms, as are many theories that later develop into new sciences - e.g., Darwin's evolutionary theory was developed in a thoroughly philosophic way and it is only at its latest stage that genetic research became possible (i.e., biology).
When the question of philosophy being useless emerges, it usually means this: science has so far advanced as to make any other source of knowledge obsolete - i.e., we can rely on empirical facts and data only; no need for anything else if we want truth! This is a very naive point of view for those who know that even scientific theories are based on philosophic grounds, such as the principles of causality or uniformity (the principle stating that natural processes happen today, yesterday, and tomorrow in exactly the same way

main: mem per token = 70897348 bytes
main:     load time = 14309.80 ms
main:   sample time =   217.87 ms
main:  predict time = 137123.06 ms / 511.65 ms per token
main:    total time = 154096.00 ms

The fp16 version seems to produce better output for this prompt, but is atrociously slow taking 315090 ms/token.

./main -m ./models/65B/ggml-model-f16.bin \ 
    -t 8 -n 256 --top_p 2 --top_k 40 \
    --repeat_penalty 1.176 \
    --temp 0.7 \
    -p 'Simply put, the relation between science and philosophy is '


main: seed = 1678906439
llama_model_load: loading model from './models/65B/ggml-model-f16.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 8192
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 64
llama_model_load: n_layer = 80
llama_model_load: n_rot   = 128
llama_model_load: f16     = 1
llama_model_load: n_ff    = 22016
llama_model_load: n_parts = 8
llama_model_load: ggml ctx size = 127085.23 MB
llama_model_load: memory_size =  2560.00 MB, n_mem = 40960
llama_model_load: loading model part 1/8 from './models/65B/ggml-model-f16.bin'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 15570.03 MB / num tensors = 723
llama_model_load: loading model part 2/8 from './models/65B/ggml-model-f16.bin.1'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 15570.03 MB / num tensors = 723
llama_model_load: loading model part 3/8 from './models/65B/ggml-model-f16.bin.2'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 15570.03 MB / num tensors = 723
llama_model_load: loading model part 4/8 from './models/65B/ggml-model-f16.bin.3'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 15570.03 MB / num tensors = 723
llama_model_load: loading model part 5/8 from './models/65B/ggml-model-f16.bin.4'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 15570.03 MB / num tensors = 723
llama_model_load: loading model part 6/8 from './models/65B/ggml-model-f16.bin.5'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 15570.03 MB / num tensors = 723
llama_model_load: loading model part 7/8 from './models/65B/ggml-model-f16.bin.6'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 15570.03 MB / num tensors = 723
llama_model_load: loading model part 8/8 from './models/65B/ggml-model-f16.bin.7'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 15570.03 MB / num tensors = 723

main: prompt: 'Simply put, the relation between science and philosophy is '
main: number of tokens in prompt = 13
     1 -> ''
  8942 -> 'Sim'
 17632 -> 'ply'
  1925 -> ' put'
 29892 -> ','
   278 -> ' the'
  8220 -> ' relation'
  1546 -> ' between'
 10466 -> ' science'
   322 -> ' and'
 22237 -> ' philosophy'
   338 -> ' is'
 29871 -> ' '

sampling parameters: temp = 0.700000, top_k = 40, top_p = 2.000000, repeat_last_n = 64, repeat_penalty = 1.176000


Simply put, the relation between science and philosophy is 100% a product of their respective intellectual histories.
It has always bothered me that many philosophers see themselves as being in competition with scientists to answer questions about human nature or even (as we'll see) what it means to understand something. I would actually say there are two competitions going on here, and each is a product of the history of philosophy since its "birth" around 500 BC.
The first competition arises because philosophers have for centuries been engaged in trying to answer questions about human nature or what it means to understand something without ever having any real data to go on except introspection, and they didn't really know how to use that very well (and still don't). When the scientific method started being used during the Renaissance, suddenly philosophers had competition. They weren't the only ones thinking about these questions anymore; scientists were now in the game too.
The second competition is more recent and arises from a long, slow shift within philosophy that began early last century (and which I will discuss at length in a later post). Simply put, this shift was away from attempting to answer basic philosophical questions by argumentation alone towards trying to gather data about how

main: mem per token = 70897348 bytes
main:     load time = 115983.75 ms
main:   sample time =   915.42 ms
main:  predict time = 84444192.00 ms / 315090.28 ms per token
main:    total time = 84656424.00 ms