Hitch Trailblazer has a puckered hole, and that bitch loves it - Sunny Flare Git LFS initialized. Cloning into '/content/fix'... remote: Enumerating objects: 74, done. remote: Total 74 (delta 0), reused 0 (delta 0), pack-reused 74 Unpacking objects: 100% (74/74), 21.93 KiB | 1.15 MiB/s, done. author -- TheBloke repo -- MythoMax-L2-13B-GGUF branch -- main filename -- mythomax-l2-13b.Q6_K.gguf link_raw -- https://huggingface.co/TheBloke/MythoMax-L2-13B-GGUF link_main -- https://huggingface.co/TheBloke/MythoMax-L2-13B-GGUF link_repo -- https://huggingface.co/TheBloke/MythoMax-L2-13B-GGUF/tree/main link_file -- https://huggingface.co/TheBloke/MythoMax-L2-13B-GGUF/blob/main/mythomax-l2-13b.Q6_K.gguf link_file_download -- https://huggingface.co/TheBloke/MythoMax-L2-13B-GGUF/resolve/main/mythomax-l2-13b.Q6_K.gguf format -- gguf backend -- koboldcpp mode -- file beaks -- 13 quantz -- q6_k quantz_num -- 6 bits -- unknown pointer -- mythomax-l2-13b.Q6_K.gguf path_pointer -- /content/colabTemp/mythomax-l2-13b.Q6_K.gguf link_pointer -- https://huggingface.co/TheBloke/MythoMax-L2-13B-GGUF/resolve/main/mythomax-l2-13b.Q6_K.gguf ::: NOTIFICATIONS ::: model has 6 quantz and 13b - Colab will work alright but higher context might not available ::: Colab is magic ::: ...[context] user didn't set context. will calculate max context automatically :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: ::: notebook automatically calculated the about-right* CONTEXT for this model: ::: 4096 ::: if model does not work properly with that context (CUDA error) then try lower it by 512 ::: it is only an approximate value - test and report back. especially with non-13b models :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: ::: DOWNLOAD BACKEND, MODEL, LORA ::: ...[backend] processing the backend: kobold.cpp ...[cloudflare] nohup.out from the previous instance was detected, deleting ...[model] processing the model: https://huggingface.co/TheBloke/MythoMax-L2-13B-GGUF ...[backend] kobold.cpp is already installed ...[cloudflare] launching Cloudflare and waiting for an answer...[model] model is already downloaded ...[LoRA] processing LoRA ...[LoRA] LoRAs are specified, will process them now ...[LoRA] processing LoRA https://huggingface.co/hfmlp/llama-2-13b-pny-3e ...[LoRA] LoRA llama-2-13b-pny-3e is already downloaded ::: LAUNCHING BACKEND ::: nohup: appending output to 'nohup.out' Cloudflare tunnel was created, here is the link: 2023-11-09T10:05:27Z INF Thank you for trying Cloudflare Tunnel. Doing so, without a Cloudflare account, is a quick way to experiment and try it out. However, be aware that these account-less Tunnels have no uptime guarantee. If you intend to use Tunnels in production you should use a pre-created named tunnel by following: https://developers.cloudflare.com/cloudflare-one/connections/connect-apps 2023-11-09T10:05:27Z INF Requesting new quick Tunnel on trycloudflare.com... 2023-11-09T10:05:28Z INF +--------------------------------------------------------------------------------------------+ 2023-11-09T10:05:28Z INF | Your quick Tunnel has been created! Visit it at (it may take some time to be reachable): | 2023-11-09T10:05:28Z INF | https://seat-starring-null-discounts.trycloudflare.com | 2023-11-09T10:05:28Z INF +--------------------------------------------------------------------------------------------+ 2023-11-09T10:05:28Z INF Cannot determine default configuration path. No file [config.yml config.yaml] in [~/.cloudflared ~/.cloudflare-warp ~/cloudflare-warp /etc/cloudflared /usr/local/etc/cloudflared] 2023-11-09T10:05:28Z INF Version 2023.10.0 2023-11-09T10:05:28Z INF GOOS: linux, GOVersion: go1.20.6, GoArch: amd64 2023-11-09T10:05:28Z INF Settings: map[ha-connections:1 protocol:quic url:http://localhost:5001] 2023-11-09T10:05:28Z INF Generated Connector ID: 039a7411-8213-4b9c-8190-81ebf24e5e6f 2023-11-09T10:05:28Z INF Autoupdate frequency is set autoupdateFreq=86400000 2023-11-09T10:05:28Z INF Initial protocol quic 2023-11-09T10:05:28Z INF ICMP proxy will use 172.28.0.12 as source for IPv4 2023-11-09T10:05:28Z INF ICMP proxy will use :: as source for IPv6 2023-11-09T10:05:28Z INF Starting metrics server on 127.0.0.1:33803/metrics 2023/11/09 10:05:28 failed to sufficiently increase receive buffer size (was: 208 kiB, wanted: 2048 kiB, got: 416 kiB). See https://github.com/quic-go/quic-go/wiki/UDP-Receive-Buffer-Size for details. 2023-11-09T10:05:28Z INF Registered tunnel connection connIndex=0 connection=718e7e64-d24e-455d-97e6-2e427e21ec1a event=0 ip=198.41.200.113 location=ord06 protocol=quic backend is launched with the following flags: python /content/colabMain/koboldcpp/koboldcpp.py --highpriority --threads 2 --usecublas normal 0 mmq --gpulayers 43 --hordeconfig MythoMax-L2-13B-GGUF --lora /content/colabTemp/llama-2-13b-pny-3e/adapter_model.bin --model /content/colabTemp/mythomax-l2-13b.Q6_K.gguf --context 4096 *** Welcome to KoboldCpp - Version 1.46.1 Setting process to Higher Priority - Use Caution High Priority for Linux Set: 0 to 1 Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required. Initializing dynamic library: koboldcpp_cublas.so ========== Namespace(model='/content/colabTemp/mythomax-l2-13b.Q6_K.gguf', model_param='/content/colabTemp/mythomax-l2-13b.Q6_K.gguf', port=5001, port_param=5001, host='', launch=False, lora=['/content/colabTemp/llama-2-13b-pny-3e/adapter_model.bin'], config=None, threads=2, blasthreads=2, highpriority=True, contextsize=4096, blasbatchsize=512, ropeconfig=[0.0, 10000.0], smartcontext=False, bantokens=None, forceversion=0, nommap=False, usemlock=False, noavx2=False, debugmode=-1, skiplauncher=False, hordeconfig=['MythoMax-L2-13B-GGUF'], noblas=False, useclblast=None, usecublas=['normal', '0', 'mmq'], gpulayers=43, tensor_split=None, onready='', multiuser=False, foreground=False) ========== Loading model: /content/colabTemp/mythomax-l2-13b.Q6_K.gguf [Threads: 2, BlasThreads: 2, SmartContext: False] --- Identified as LLAMA model: (ver 6) Attempting to Load... --- Using automatic RoPE scaling (scale:1.000, base:10000.0) System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | ggml_init_cublas: found 1 CUDA devices: Device 0: Tesla T4, compute capability 7.5 llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from /content/colabTemp/mythomax-l2-13b.Q6_K.gguf (version GGUF V2 (latest)) llm_load_print_meta: format = GGUF V2 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 4096 llm_load_print_meta: n_embd = 5120 llm_load_print_meta: n_head = 40 llm_load_print_meta: n_head_kv = 40 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: n_ff = 13824 llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: model type = 13B llm_load_print_meta: model ftype = unknown, may not work (guessed) llm_load_print_meta: model params = 13.02 B llm_load_print_meta: model size = 9.95 GiB (6.56 BPW) llm_load_print_meta: general.name = LLaMA v2 llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 10183.83 MB llm_load_tensors: using CUDA for GPU acceleration llm_load_tensors: mem required = 128.29 MB llm_load_tensors: offloading 40 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 43/43 layers to GPU llm_load_tensors: VRAM used: 10055.54 MB ................................................................................................... llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 WARNING: failed to allocate 3202.00 MB of pinned memory: out of memory llama_kv_cache_init: offloading v cache to GPU llama_kv_cache_init: offloading k cache to GPU llama_kv_cache_init: VRAM kv self = 3200.00 MB llama_new_context_with_model: kv self size = 3200.00 MB llama_new_context_with_model: compute buffer total size = 363.88 MB llama_new_context_with_model: VRAM scratch buffer: 358.00 MB llama_new_context_with_model: total VRAM used: 13613.54 MB (model: 10055.54 MB, context: 3558.00 MB) Attempting to apply LORA adapter: /content/colabTemp/llama-2-13b-pny-3e/adapter_model.bin llama_apply_lora_from_file_internal: applying lora adapter from '/content/colabTemp/llama-2-13b-pny-3e/adapter_model.bin' - please wait ... llama_apply_lora_from_file_internal: unsupported file version gpttype_load_model: error: failed to apply lora adapter Load Model OK: False Could not load model: /content/colabTemp/mythomax-l2-13b.Q6_K.gguf