Cappy: Outperforming And Boosting Large Multi-task Language Models With A Small Scorer

Posted by Yun Zhu and Lijuan Liu, Software Engineers, Google Research

Large connection exemplary (LLM) advancements person led to a caller paradigm that unifies various earthy connection processing (NLP) tasks wrong an instruction-following framework. This paradigm is exemplified by caller multi-task LLMs, specified arsenic T0, FLAN, and OPT-IML. First, multi-task information is gathered pinch each task pursuing a task-specific template, wherever each branded illustration is converted into an instruction (e.g., "Put nan concepts together to shape a sentence: ski, mountain, skier”) paired pinch a corresponding consequence (e.g., "Skier skis down nan mountain"). These instruction-response pairs are utilized to train nan LLM, resulting successful a conditional procreation exemplary that takes an instruction arsenic input and generates a response. Moreover, multi-task LLMs person exhibited singular task-wise generalization capabilities arsenic they tin reside unseen tasks by knowing and solving brand-new instructions.

The objection of nan instruction-following pre-training of multi-task LLMs, e.g., FLAN. Pre-training tasks nether this paradigm improves nan capacity for unseen tasks.

Due to nan complexity of knowing and solving various tasks solely utilizing instructions, nan size of multi-task LLMs typically spans from respective cardinal parameters to hundreds of billions (e.g., FLAN-11B, T0-11B and OPT-IML-175B). As a result, operating specified sizable models poses important challenges because they request sizeable computational powerfulness and enforce important requirements connected nan representation capacities of GPUs and TPUs, making their training and conclusion costly and inefficient. Extensive retention is required to support a unsocial LLM transcript for each downstream task. Moreover, nan astir powerful multi-task LLMs (e.g., FLAN-PaLM-540B) are closed-sourced, making them intolerable to beryllium adapted. However, successful applicable applications, harnessing a azygous multi-task LLM to negociate each conceivable tasks successful a zero-shot mode remains difficult, peculiarly erstwhile dealing pinch analyzable tasks, personalized tasks and those that cannot beryllium succinctly defined utilizing instructions. On nan different hand, nan size of downstream training information is usually insufficient to train a exemplary good without incorporating rich | anterior knowledge. Hence, it is agelong desired to accommodate LLMs pinch downstream supervision while bypassing storage, memory, and entree issues.

Certain parameter-efficient tuning strategies, including prompt tuning and adapters, substantially diminish retention requirements, but they still execute back-propagation done LLM parameters during nan tuning process, thereby keeping their representation demands high. Additionally, immoderate in-context learning techniques circumvent parameter tuning by integrating a constricted number of supervised examples into nan instruction. However, these techniques are constrained by nan model's maximum input length, which permits only a fewer samples to guideline task resolution.

In “Cappy: Outperforming and Boosting Large Multi-Task LMs pinch a Small Scorer”, presented astatine NeurIPS 2023, we propose a caller attack that enhances nan capacity and ratio of multi-task LLMs. We present a lightweight pre-trained scorer, Cappy, based connected continual pre-training connected apical of RoBERTa pinch simply 360 cardinal parameters. Cappy takes successful an instruction and a campaigner consequence arsenic input, and produces a people betwixt 0 and 1, indicating an estimated correctness of nan consequence pinch respect to nan instruction. Cappy functions either independently connected classification tasks aliases serves arsenic an auxiliary constituent for LLMs, boosting their performance. Moreover, Cappy efficiently enables downstream supervision without requiring immoderate finetuning, which avoids nan request for back-propagation done LLM parameters and reduces representation requirements. Finally, adjustment pinch Cappy doesn’t require entree to LLM parameters arsenic it is compatible pinch closed-source multi-task LLMs, specified arsenic those only accessible via WebAPIs.

Cappy takes an instruction and consequence brace arsenic input and outputs a people ranging from 0 to 1, indicating an estimation of nan correctness of nan consequence pinch respect to nan instruction.

Pre-training

We statesman pinch nan aforesaid dataset collection, which includes 39 divers datasets from PromptSource that were utilized to train T0. This postulation encompasses a wide scope of task types, specified arsenic mobility answering, sentiment analysis, and summarization. Each dataset is associated pinch 1 aliases much templates that person each lawsuit from nan original datasets into an instruction paired pinch its crushed truth response.

Cappy's regression modeling requires each pre-training information lawsuit to see an instruction-response brace on pinch a correctness note for nan response, truthful we nutrient a dataset pinch correctness annotations that scope from 0 to 1. For each lawsuit wrong a procreation task, we leverage an existing multi-task LLM to make aggregate responses by sampling, conditioned connected nan fixed instruction. Subsequently, we delegate an note to nan brace formed by nan instruction and each response, utilizing nan similarity betwixt nan consequence and nan crushed truth consequence of nan instance. Specifically, we employment Rouge-L, a commonly-used metric for measuring wide multi-task capacity that has demonstrated a beardown alignment pinch quality evaluation, to cipher this similarity arsenic a shape of anemic supervision.

As a result, we get an effective regression dataset of 160 cardinal instances paired pinch correctness people annotations. The last Cappy exemplary is nan consequence of continuous pre-training utilizing nan regression dataset connected apical of nan RoBERTa model. The pre-training of Cappy is conducted connected Google's TPU-v4, pinch RedCoast, a lightweight toolkit for automating distributed training.

Data augmentation pinch a multi-task LLM to conception a weakly supervised regression dataset for Cappy’s pre-training and fine-tuning.

Applying Cappy

Cappy solves applicable tasks wrong a candidate-selection mechanism. More specifically, fixed an instruction and a group of campaigner responses, Cappy produces a people for each campaigner response. This is achieved by inputting nan instruction alongside each individual response, and past assigning nan consequence pinch nan highest people arsenic its prediction. In classification tasks, each campaigner responses are inherently predefined. For example, for an instruction of a sentiment classification task (e.g., “Based connected this review, would nan personification urge this product?: ‘Stunning moreover for nan non-gamer.’”), nan campaigner responses are “Yes” aliases “No”. In specified scenarios, Cappy functions independently. On nan different hand, successful procreation tasks, campaigner responses are not pre-defined, requiring an existing multi-task LLM to output nan campaigner responses. In this case, Cappy serves arsenic an auxiliary constituent of nan multi-task LLM, enhancing its decoding.

Adapting multi-task LLMs pinch Cappy

When location is disposable downstream training data, Cappy enables effective and businesslike adjustment of multi-task LLMs connected downstream tasks. Specifically, we fine-tune Cappy to merge downstream task accusation into LLM predictions. This process involves creating a abstracted regression dataset circumstantial to nan downstream training information pinch nan aforesaid information note process utilized to conception nan pre-training data. As a result, nan fine-tuned Cappy collaborates pinch a multi-task LLM, boosting nan LLM's capacity connected nan downstream task.

In opposition to different LLM tuning strategies, adapting LLMs pinch Cappy importantly reduces nan precocious request for instrumentality representation arsenic it avoids nan request for back-propagation done LLM parameters for downstream tasks. Moreover, Cappy adjustment does not trust connected nan entree to LLM parameters, making it compatible pinch closed-source multi-task LLMs, specified arsenic nan ones only accessible via WebAPIs. Compared pinch in-context learning approaches, which circumvent exemplary tuning by attaching training examples to nan instruction prefix, Cappy is not restricted by nan LLM's maximum input length. Thus, Cappy tin incorporated an unlimited number of downstream training examples. Cappy tin besides beryllium applied pinch different adjustment methods, specified arsenic fine-tuning and in-context learning, further boosting their wide performance.

Downstream adjustment comparison betwixt Cappy and approaches that trust connected an LLM’s parameters, specified arsenic fine-tuning and punctual tuning. Cappy’s exertion enhances multi-task LLMs.

Results

We measure Cappy’s capacity crossed eleven held-out connection knowing classification tasks from PromptSource. We show that Cappy, pinch 360M parameters, outperforms OPT-175B and OPT-IML-30B, and matches nan accuracy of nan champion existing multi-task LLMs (T0-11B and OPT-IML-175B). These findings item Cappy’s capabilities and parameter efficiency, which tin beryllium credited to its scoring-based pre-training strategy that integrates contrastive accusation by differentiating betwixt high-quality and low-quality responses. On nan contrary, erstwhile multi-task LLMs dangle exclusively connected teacher-forcing training that utilizes only nan crushed truth responses.

The wide accuracy averaged complete eleven trial tasks from PromptSource. “RM” refers to a pre-trained RLHF reward model. Cappy matches nan champion ones among existing multi-task LLMs.

We besides analyse nan adjustment of multi-task LLMs pinch Cappy connected analyzable tasks from BIG-Bench, a group of manually curated tasks that are considered beyond nan capacity of galore LLMs. We attraction connected each nan 45 procreation BIG-Bench tasks, specifically those that do not connection pre-established reply choices. We measure nan capacity utilizing nan Rouge-L people (representing nan wide similarity betwixt exemplary generations and corresponding crushed truths) connected each trial set, reporting nan mean people crossed 45 tests. In this experiment, each variants of FLAN-T5 service arsenic nan backbone LLMs, and nan foundational FLAN-T5 models are frozen. These results, shown below, propose that Cappy enhances nan capacity of FLAN-T5 models by a ample margin, consistently outperforming nan astir effective baseline achieved done sample action utilizing self-scoring of nan LLM itself.

The averaged Rouge-L people complete 45 analyzable tasks wrong BIG-Bench. The x-axis refers to FLAN-T5 models of different sizes. Every dashed statement represents an attack moving connected FLAN-T5s. Self-scoring refers to utilizing nan cross-entropy of LLM to prime responses. Cappy enhances nan capacity of FLAN-T5 models by a ample margin.

Conclusion

We present Cappy, a caller attack that enhances nan capacity and ratio of multi-task LLMs. In our experiments, we accommodate a azygous LLM to respective domains pinch Cappy. In nan future, Cappy arsenic a pre-trained exemplary tin perchance beryllium utilized successful different imaginative ways beyond connected azygous LLMs.

Acknowledgments

Thanks to Bowen Tan, Jindong Chen, Lei Meng, Abhanshu Sharma and Ewa Dominowska for their valuable feedback. We would besides for illustration to convey Eric Xing and Zhiting Hu for their suggestions.