TinyLlama: Exploring the Small Yet Powerful Language Model's Gradio Demo

Nov 11, 2024 07:08 PM - 1 month ago 40726

In this article we will research the ample connection exemplary TinyLlama, a compact 1.1B connection exemplary pre-trained connected astir 1 trillion tokens for 3 epochs (approx.). TinyLlama is built connected the architecture and tokenizer of Llama 2 (Touvron et al., 2023b), a caller summation to the  advancements from the open-source community. TinyLlama not only enhances computational ratio but this exemplary outperforms different comparable-sized connection models successful various downstream tasks, showcasing its singular performance.

Introduction

Recent developments successful earthy connection processing (NLP) person chiefly resulted from the scaling up of connection exemplary sizes. Large Language Models (LLMs), pre-trained connected extended matter corpora, person proven highly effective crossed divers tasks specified arsenic matter summarization, contented creation, condemnation structuring and galore more. A immense number of studies item emergent capabilities successful LLMs, specified arsenic few-shot prompting and chain-of-thought reasoning, which are much salient successful models pinch a important number of parameters. Additionally, these investigation efforts stress the value of scaling some exemplary size and training information together for optimal computational efficiency. This penetration guides the action of exemplary size and information allocation, particularly erstwhile faced pinch a fixed compute budget.

A batch of times ample models are preferred; this often leads to overlooking the smaller model.

Language models erstwhile designed for optimal conclusion run to execute highest capacity wrong defined conclusion limitations. This is accomplished by training models pinch a higher number of tokens than recommended by the scaling law. Interestingly, smaller models, erstwhile exposed to much training data, tin scope aliases surpass the capacity of larger counterparts.

The TinyLlama exemplary is much focused successful training the exemplary pinch a ample number of tokens alternatively of utilizing the scaling law.

This is the first effort to train a exemplary pinch 1B parameters utilizing specified a ample magnitude of data.-Original Research Paper

TinyLlama demonstrates beardown capacity erstwhile compared to different open-source connection models of akin sizes, outperforming some OPT-1.3B and Pythia1.4B crossed various downstream tasks. This exemplary is unfastened sourced, contributing to accrued accessibility for connection exemplary researchers. Its awesome capacity and compact size position it arsenic an appealing action for some researchers and practitioners successful the A.I. field.

Prerequisites

  • pip: Update pip and instal basal packages.
  • GPU (Optional): For amended performance, usage a instrumentality pinch an NVIDIA GPU and CUDA support.
  • Dependencies: Install required packages (e.g., torch, transformers, gradio).

These setups will thief you research the TinyLlama Gradio demo smoothly.

Gradio App Demo of TinyLlama

This article provides a concise preamble to TinyLlama, featuring a objection done a Gradio app. Gradio facilitates an businesslike measurement to showcase models by converting them into user-friendly web interfaces, accessible to a broader audience.

import torch device = torch.device("cuda" if use_cuda else "cpu") print("Device: ",device) use_cuda = torch.cuda.is_available() if use_cuda: print('__CUDA VERSION:', torch.backends.cudnn.version()) print('__Number CUDA Devices:', torch.cuda.device_count()) print('__CUDA Device Name:',torch.cuda.get_device_name(0)) print('__CUDA Device Total Memory [GB]:',torch.cuda.get_device_properties(0).total_memory/1e9) __CUDNN VERSION: 8401 __Number CUDA Devices: 1 __CUDA Device Name: NVIDIA RTX A4000 __CUDA Device Total Memory [GB]: 16.89124864

Pretraining and Model Architecture

The exemplary has been pre-trained utilizing the earthy connection information from SlimPajama and the codification information from Starcoderdata.

TinyLlama has a akin transformer based architectural attack to Llama 2.

image

Model Architecture Overview Source

The exemplary includes RoPE (Rotary Positional Embedding) for positional embedding, a method utilized successful caller ample connection models for illustration PaLM, Llama, and Qwen. Pre-normalization is employed pinch RMSNorm for unchangeable training. Instead of ReLU, the SwiGLU activation usability (Swish and Gated Linear Unit) from Llama 2 is used. For businesslike representation usage, grouped-query attraction is adopted pinch 32 heads for query attraction and 4 groups of key-value heads, allowing sharing of cardinal and worth representations crossed aggregate heads without important capacity loss.

Further, the investigation included Fully Sharded Data Parallel (FSDP)1 to optimize the utilization of multi-GPU and multi-node setups during training. This integration is important for efficiently scaling the training process crossed aggregate computing nodes, this resulted successful a important betterment successful training velocity and efficiency. To adhd more, different important enhancement is the integration of Flash Attention 2 (Dao, 2023), an optimized attraction mechanism. Also, the replacement of SwiGLU module from the xFormers (Lefaudeux et al., 2022) repository pinch the original SwiGLU module lead a simplification successful representation footprint. As a result, the 1.1B exemplary tin now comfortably fresh wrong 40GB of GPU RAM.

The incorporation of these elements
has propelled our training throughput to 24,000 tokens per 2nd per A100-40G GPU. When
compared pinch different models for illustration Pythia-1.0B (Biderman et al., 2023) and MPT-1.3B 2
, our codebase
demonstrates superior training speed. For instance, the TinyLlama-1.1B exemplary requires only 3,456
A100 GPU hours for 300B tokens, successful opposition to Pythia’s 4,830 and MPT’s 7,920 hours. This shows
the effectiveness of our optimizations and the imaginable for important clip and assets savings in
large-scale exemplary training. -Original Research Paper

image

Comparison of the Training Speed (Source)

Code Demo

Let america return a person look astatine TinyLlama, earlier we commencement please make judge that you person transformers>=4.31.

Install the basal packages

!pip instal accelerate !pip instal transformers==4.36.2 !pip instal gradio

Once the packages are installed make judge to restart the kernel

Import the basal libraries

from transformers import AutoTokenizer import transformers import torch

Initialize the Model and the Tokenizer and usage TinyLlama to make texts

model = "PY007/TinyLlama-1.1B-Chat-v0.1" tokenizer = AutoTokenizer.from_pretrained(model) pipeline = transformers.pipeline( "text-generation", model=model, torch_dtype=torch.float16, device_map="auto", ) prompt = "What are the values successful unfastened root projects?" formatted_prompt = ( f"### Human: {prompt}### Assistant:" ) sequences = pipeline( formatted_prompt, do_sample=True, top_k=50, top_p = 0.7, num_return_sequences=1, repetition_penalty=1.1, max_new_tokens=500, ) for seq in sequences: print(f"Result: {seq['generated_text']}")

Results

We person tested the exemplary to understand its ratio and we tin reason that the exemplary useful good for wide q and a and is not suitable for calculations. This makes consciousness arsenic these models are chiefly designed for earthy connection knowing and procreation tasks.

0:00

/0:10

Understanding the Model’s Language Understanding and Problem Solving Capabilities

TinyLlama’s problem solving abilities has been evaluated utilizing the InstructEval benchmark, which comprises respective tasks. Also, successful the Massive Multitask Language Understanding (MMLU) task, the model’s world knowledge and problem-solving capabilities are tested crossed various subjects successful a 5-shot setting. The BIG-Bench Hard (BBH) task, a subset of 23 challenging tasks from BIG-Bench, evaluates the model’s expertise to travel analyzable instructions successful a 3-shot setting. The Discrete Reasoning Over Paragraphs (DROP) task focuses connected measuring the model’s mathematics reasoning abilities successful a 3-shot setting. Additionally, the HumanEval task assesses the model’s programming capabilities successful a zero-shot setting. This divers group of tasks provides a broad information of TinyLlama’s problem-solving and connection knowing skills.

Conclusion

In this article we present TinyLlama, an unfastened source, mini connection exemplary a caller attack successful the world of LLMs. We are grateful for the truth that these models are unfastened originated to the community.
TinyLlama, pinch its compact creation and awesome performance, has the imaginable to support end-user applications connected mobile devices and service arsenic a lightweight level for experimenting pinch innovative connection exemplary ideas.

I dream you enjoyed the article and the gradio demo!

Thank you for reading!

More