Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

03 Tools / AI / 2211.5624_Large Language Models are Human-Level Prompt Engineers

.pdf
Скачиваний:
0
Добавлен:
08.01.2024
Размер:
3.83 Mб
Скачать

Published as a conference paper at ICLR 2023

Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. GLM: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 320–335, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/ v1/2022.acl-long.26. URL https://aclanthology.org/2022.acl-long.26.

Kevin Ellis, Lucas Morales, Mathias Sablé-Meyer, Armando Solar-Lezama, and Josh Tenenbaum. Learning libraries of subroutines for neurally–guided bayesian program induction. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/ 7aa685b3b1dc1d6780bf36f7340078c9-Paper.pdf.

Kevin Ellis, Catherine Wong, Maxwell Nye, Mathias Sablé-Meyer, Lucas Morales, Luke Hewitt, Luc Cary, Armando Solar-Lezama, and Joshua B Tenenbaum. Dreamcoder: Bootstrapping inductive program synthesis with wake-sleep library learning. In Proceedings of the 42nd acm sigplan international conference on programming language design and implementation, pp. 835–850, 2021.

Tianyu Gao. Prompting: Better ways of using language models for nlp tasks. The Gradient, 2021.

Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 3816–3830, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.295. URL https://aclanthology.org/2021.acl-long. 295.

Sumit Gulwani, Oleksandr Polozov, Rishabh Singh, et al. Program synthesis. Foundations and Trends® in Programming Languages, 4(1-2):1–119, 2017.

Or Honovich, Uri Shaham, Samuel R Bowman, and Omer Levy. Instruction induction: From few examples to natural language task descriptions. arXiv preprint arXiv:2205.10782, 2022.

Naman Jain, Skanda Vaidyanath, Arun Iyer, Nagarajan Natarajan, Suresh Parthasarathy, Sriram Rajamani, and Rahul Sharma. Jigsaw: Large language models meet program synthesis. In

Proceedings of the 44th International Conference on Software Engineering, pp. 1219–1231, 2022.

Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham Neubig. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423–438, 2020.

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022.

Woosuk Lee, Kihong Heo, Rajeev Alur, and Mayur Naik. Accelerating search-based program synthesis using learned probabilistic models. ACM SIGPLAN Notices, 53(4):436–449, 2018.

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3045–3059, 2021.

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode. arXiv preprint arXiv:2203.07814, 2022.

Percy Liang, Michael I. Jordan, and Dan Klein. Learning programs: A hierarchical bayesian approach. In Johannes Fürnkranz and Thorsten Joachims (eds.), Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, pp. 639–646. Omnipress, 2010. URL https://icml.cc/Conferences/2010/papers/568.pdf.

11

Published as a conference paper at ICLR 2023

Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https: //aclanthology.org/2022.acl-long.229.

Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. Gpt understands, too. arXiv preprint arXiv:2103.10385, 2021.

Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786, 2021.

Dougal Maclaurin and Ryan Prescott Adams. Firefly monte carlo: Exact mcmc with subsets of data. In Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.

Aditya Menon, Omer Tamuz, Sumit Gulwani, Butler Lampson, and Adam Kalai. A machine learning framework for programming by example. In International Conference on Machine Learning, pp. 187–195. PMLR, 2013.

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021.

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.

Ethan Perez, Douwe Kiela, and Kyunghyun Cho. True few-shot learning with language models.

Advances in Neural Information Processing Systems, 34:11054–11070, 2021.

Guanghui Qin and Jason Eisner. Learning how to ask: Querying lms with mixtures of soft prompts. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 5203–5212, 2021.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J Liu, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67, 2020.

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical textconditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.

Laria Reynolds and Kyle McDonell. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–7, 2021.

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. Highresolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695, 2022.

Subhro Roy and Dan Roth. Solving general arithmetic word problems. arXiv preprint arXiv:1608.01413, 2016.

Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, 2022.

Timo Schick and Hinrich Schütze. Exploiting cloze-questions for few-shot text classification and natural language inference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 255–269, 2021.

12

Published as a conference paper at ICLR 2023

Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. AutoPrompt: Eliciting knowledge from language models with automatically generated prompts. In Empirical Methods in Natural Language Processing (EMNLP), 2020.

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.

Albert Webson and Ellie Pavlick. Do prompt-based models really understand the meaning of their prompts? arXiv preprint arXiv:2109.01247, 2021.

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2021.

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022a.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022b.

Catherine Wong, Kevin M Ellis, Joshua Tenenbaum, and Jacob Andreas. Leveraging language to learn program abstractions and search heuristics. In International Conference on Machine Learning, pp. 11193–11204. PMLR, 2021.

Weizhe Yuan, Graham Neubig, and Pengfei Liu. Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34:27263–27277, 2021.

Eric Zelikman, Yuhuai Wu, and Noah D Goodman. Star: Bootstrapping reasoning with reasoning. arXiv preprint arXiv:2203.14465, 2022.

Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.

Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.

13

Published as a conference paper at ICLR 2023

APROMPT ENGINEERING IN THE WILD

Large models with natural language interfaces, including models for text generation and image synthesis, have seen an increasing amount of public usage in recent years. As finding the right prompt can be difficult for humans, a number of guides on prompt engineering as well as tools to aid in prompt discovery have been developed. Among others, see, for example:

https://blog.andrewcantino.com/blog/2021/04/21/prompt-engineering-tips-and-tricks/

https://techcrunch.com/2022/07/29/a-startup-is-charging-1-99-for-strings-of-text-to-feed-to-dall-e-2/

https://news.ycombinator.com/item?id=32943224

https://promptomania.com/stable-diffusion-prompt-builder/

https://huggingface.co/spaces/Gustavosta/MagicPrompt-Stable-Diffusion

In this paper we apply APE to generate effective instructions for steering LLMs, but the general framework Algorithm 1 could be applied to steer other models with natural language interfaces so long as an appropriate proposal method and scoring function can be designed.

14

Published as a conference paper at ICLR 2023

BIMPLEMENTATION DETAILS

Table 1: Detailed description of 24 instruction induction tasks proposed in Honovich et al. (2022). For convenience, the original table from Honovich et al. (2022) is duplicated here.

Category

Task

Instruction

Demonstration

 

Spelling

First Letter

Extract the first letter of the input word.

cat → c

 

 

Second Letter

Extract the second letter of the input word.

cat → a

 

 

List Letters

Break the input word into letters, sepa-

cat → c a t

 

 

rated by spaces.

 

 

 

Starting With

Extract the words starting with a given

The man whose car I hit last week

 

 

letter from the input sentence.

sued me. [m] → man, me

 

Morpho-

Pluralization

Convert the input word to its plural form.

cat → cats

syntax

 

 

 

 

 

Passivization

Write the input sentence in passive form.

The artist introduced the scientist.

 

 

 

→ The scientist was introduced

 

 

 

by the artist.

 

Syntax

Negation

Negate the input sentence.

Time is finite → Time is not fi-

 

 

 

nite.

 

Lexical

Antonyms

Write a word that means the opposite of

won → lost

Semantics

 

the input word.

 

 

 

Synonyms

Write a word with a similar meaning to

alleged → supposed

 

 

the input word.

 

 

Membership

Write all the animals that appear in the

 

given list.

cat, helicopter, cook, whale, frog, lion → frog, cat, lion, whale

Phonetics

Rhymes

Write a word that rhymes with the input

sing → ring

 

 

word.

 

Knowledge

Larger Animal

Write the larger of the two given animals.

koala, snail → koala

Semantics Cause Selection Find which of the two given cause and effect sentences is the cause.

Sentence 1: The soda went flat. Sentence 2: The bottle was left open. → The bottle was left open.

 

Common

Find a common characteristic for the given

guitars, pendulums, neutrinos →

 

Concept

objects.

involve oscillations.

 

Style

Formality

Rephrase the sentence in formal language.

Please call once you get there →

 

 

 

Please call upon your arrival.

 

Numerical

Sum

Sum the two given numbers.

22 10 → 32

 

 

Difference

Subtract the second number from the first.

32 22 → 10

 

 

Number to Word

Write the number in English words.

26 → twenty-six

 

Multi-

Translation

Translate the word into German / Spanish

game → juego

lingual

 

/ French.

 

 

GLUE

Sentiment

Determine whether a movie review is pos-

The film is small in scope, yet per-

 

Analysis

itive or negative.

fectly formed. → positive

 

 

Sentence

Rate the semantic similarity of two input

Sentence 1: A man is smoking.

 

Similarity

sentences on a scale of 0 - definitely not

Sentence 2: A man is skating. →

 

 

to 5 - perfectly.

0 - definitely not

 

 

Word in Context

Determine whether an input word has the

Sentence 1: Approach a task. Sen-

 

 

same meaning in the two input sentences.

tence 2: To approach the city.

 

 

 

Word: approach → not the same

 

15

Published as a conference paper at ICLR 2023

Table 2: Detailed description of BIG-Bench Instruction Induction (BBII), a clean and tractable subset of 21 tasks that have a clear human written instruction that can be applied to all examples in the dataset.

Name

 

Description

Keywords

 

 

 

 

causal reasoning, common sense, multi-

causal judgment

Answer questions about causal attribution

ple choice, reading comprehension, so-

 

 

 

cial reasoning

 

 

 

 

 

disambiguation qa

Clarify the meaning of sentences with am-

common sense, gender bias, many-shot,

biguous pronouns

multiple choice

 

 

 

 

 

 

 

 

dyck languages

Correctly close a Dyck-n word

algebra, arithmetic, logical reasoning,

multiple choice

 

 

 

 

 

 

 

 

 

 

 

Determine whether one sentence entails the

common sense, logical reasoning, mul-

epistemic reasoning

tiple choice, social reasoning, theory of

next

 

 

mind

 

 

 

 

 

 

 

 

 

 

gender

inclusive

Given a German language sentence that does

free response, grammar, inclusion, non-

not use gender-inclusive forms, transform it

sentences german

English, paraphrase

 

to gender-inclusive forms

 

 

 

 

 

 

 

 

 

 

 

Predict whether Speaker 2’s answer to

contextual question-answering, multiple

implicatures

choice, reading comprehension, social

Speaker 1 counts as a yes or as a no

 

 

reasoning, theory of mind

 

 

 

 

 

 

 

 

 

 

 

 

free response, human-like behavior, lin-

linguistics puzzles

Solve Rosetta Stone-style linguistics puzzles

guistics, logical reasoning, reading com-

 

 

 

prehension

 

 

 

 

 

 

logical

fallacy de-

Detect informal and formal logical fallacies

logical reasoning, multiple choice

tection

 

 

 

 

 

 

 

 

 

movie recommenda-

Recommend movies similar to the given list

emotional intelligence, multiple choice

tion

 

of movies

 

 

 

 

 

 

 

 

 

 

Given a series of navigation instructions, de-

arithmetic, logical reasoning, mathemat-

navigate

 

termine whether one would end up back at

 

ics, multiple choice

 

 

 

the starting point

 

 

 

 

 

 

 

 

 

 

 

 

Questions that involve enumerating objects

 

 

object counting

of different types and asking the model to

free response, logical reasoning

 

 

count them

 

 

 

 

 

 

operators

Given a mathematical operator definition in

free response, mathematics, numerical

natural language, apply it

response

 

 

 

 

 

 

 

presuppositions as

Determine whether the first sentence entails

common sense, logical reasoning, multi-

nli

 

or contradicts the second

ple choice

 

 

 

 

 

 

 

 

Given a short answer along with its context,

multiple choice, paraphrase,

reading

question selection

select the most appropriate question which

comprehension, summarization

 

 

to the given short answer

 

 

 

 

 

 

 

 

 

ruin names

Select the humorous edit that ’ruins’ the in-

emotional understanding,

multiple

put movie or musical artist name

choice

 

 

 

 

 

 

 

 

snarks

 

Determine which of two sentences is sarcas-

emotional understanding, humor, multi-

 

tic

ple choice

 

 

 

 

sports understand-

Determine whether an artificially con-

ing

structed sentence relating to sports is plausible or implausible

common sense, context-free question answering, domain specific, multiple choice

tense

Modify the tense of a given sentence

free response, paraphrase, syntax

winowhy

Evaluate the reasoning in answering Wino-

causal reasoning, common sense, multi-

grad Schema Challenge questions

ple choice, social reasoning

 

 

 

 

word sorting

Sort a list of words

algorithms, free response

 

 

 

word unscrambling

Unscramble the given letters to form an En-

free response, implicit reasoning, tok-

glish word

enization

 

 

 

 

16

Published as a conference paper at ICLR 2023

B.1 BIG-BENCH INSTRUCTION INDUCTION (BBII) SELECTION PROCESS

Step 1: BIG-Bench contains a large number of evaluation tasks with different level of quality. For example, some of the tasks only have the minimum number of examples needed to qualify for submission, while other tasks may lack an appropriate human baselines. Therefore, we follow Suzgun et al. (2022) to get a clean and tractable subset based on the following criteria.

Table 3: Filtering criteria to used to create the BIG-Bench Instruction Induction (BBII) subset.

# Tasks

Criteria

212

All BIG-Bench tasks

170

All JSON tasks

127

After filtering out tasks with more than one sub-task

74

After filtering out tasks with fewer than 150 examples

67

After filtering out tasks without human-rater baselines

57

After filtering out tasks that do not use multiple-choice or exact match as the evaluation metric

 

 

Criteria: JSON Tasks.

Discarded tasks: abstraction and reasoning corpus, bbq lite, bias from probabilities, boolean expressions, com2sense, context definition alignment, convinceme, coqa conversational question answering, cycled letters, diverse social bias, dynamic counting, factuality of summary, forecasting subquestions, gender sensitivity chinese, gender sensitivity english, high low game, long context integration, multistep arithmetic, muslim violence bias, program synthesis, protein interacting sites, python programming challenge, question answer creation, roots optimization and games, self awareness, self evaluation courtroom, self evaluation tutoring, simple arithmetic, spelling bee, squad shifts, subject verb agreement, sudoku, taboo, talkdown, text navigation game, training on test set, truthful qa, twenty questions, unqover, web of lies, word problems on sets and graphs, yes no black white.

Criteria: Tasks without sub-task.

Discarded tasks: abstract narrative understanding, arithmetic, authorship verification, bbq lite json, cause and effect, chess state tracking, cifar10 classification, color, conceptual combinations, conlang translation, cs algorithms, elementary math qa, fact checker, gem, goal step wikihow, hhh alignment, indic cause and effect, intersect geometry, kanji ascii, key value maps, language games, linguistic mappings, list functions, logical deduction, metaphor understanding, minute mysteries qa, modified arithmetic, mult data wrangling, multiemo, natural instructions, periodic elements, physics, real or fake text, simp turing concept, simple arithmetic json subtasks, simple ethical questions, strange stories, symbol interpretation, tracking shuffled objects, undo permutation, unit conversion, unit interpretation, unnatural in context learning.

Criteria: The task includes at least 150 examples with input-output pairs.

Discarded tasks: analytic entailment, auto debugging, code line description, codenames, common morpheme, crash blossom, crass ai, cryobiology spanish, dark humor detection, emoji movie, emojis emotion prediction, empirical judgments, english proverbs, english russian proverbs, entailed polarity, entailed polarity hindi, evaluating information essentiality, figure of speech detection, general knowledge, gre reading comprehension, human organs senses, identify math theorems, identify odd metaphor, implicit relations, international phonetic alphabet nli, irony identification, known unknowns, logical args, logical sequence, mathematical induction, misconceptions russian, nonsense words grammar, novel concepts, odd one out, penguins in a table, persian idioms, phrase relatedness, physical intuition, physics questions, repeat copy logic, rephrase, riddle sense, scientific press release, sentence ambiguity, similarities abstraction, simple arithmetic json, simple arithmetic json multiple choice, simple arithmetic multiple targets json, simple text editing, sufficient information, suicide risk, swedish to german proverbs, what is the tao.

Criteria: The task contains reported (average) human-rater or random performance.

Discarded tasks: contextual parametric knowledge conflicts, hinglish toxicity, medical questions russian, parsinlu qa, swahili english proverbs, tellmewhy, which wiki edit.

Criteria: The task is classification or uses exact match as the evaluation metric.

Discarded tasks: auto categorization, few shot nlg, hindi question answering, international phonetic alphabet transliterate, polish sequence labeling, qa wikidata, semantic parsing in context sparc, semantic parsing spider, social support, topical chat.

17

Published as a conference paper at ICLR 2023

Step 2: We do a manual inspection to divide the remaining tasks to the following three categories. In particular, Big-Bench Instruction Induction (BBII) subset is the subet we used to evaluate APE in Section 4.2.

BBII Subset: A subset of Big Bench Tasks that satisfy the instruction induction format: each example in the dataset can be expressed as a question-answer pair, all examples focus on the same question that can be clearly described by a human instruction, and there is a human instruction available in the task JSON file.

Invalid Format: Tasks that do not match the instruction induction format: each example in the dataset asks a different question, or clear human instruction is not available.

Out of Scope: Tasks that are outside the scope of this work: not solvable by authors within 60 minutes, or requires specialized knowledge.

Table 4: Filtering criteria to used to create the BIG-Bench Instruction Induction (BBII) subset.

# Category

# Tasks

Tasks Names

 

 

causal judgment, disambiguation qa, dyck language, epistemic reasoning, gender

 

 

inclusive sentences german, implicatures, linguistics puzzles, logical fallacy

BBII Subset

21

detection, movie recommendation, navigate, object counting, operators, presup-

 

 

positions as nli, question selection, ruin names, snarks, sports understanding,

 

 

tense, winowhy, word sorting, word unscrambling.

 

 

 

 

 

anachronisms, analogical similarity, bridging anaphora resolution barqa, data

 

 

understanding, disfl qa, fantasy reasoning, formal fallacies syllogisms negation,

Invalid Format

21

hindu knowledge, hyperbaton, intent recognition, logic grid puzzle, paragraph

segmentation, play dialog same or different, reasoning about colored objects,

 

 

 

 

salient translation error detection, social iqa, strategyqa, temporal sequences,

 

 

timedial, understanding fables, vitaminc fact verification.

 

 

 

 

 

ascii word recognition, checkmate in one, chinese remainder theorem, cryptonite,

Out of Scope

13

discourse marker prediction, geometric shapes, kannada, language identification,

matrixshapes, mnist ascii, moral permissibility, movie dialog same or different,

 

 

parsinlu reading comprehension.

18

Published as a conference paper at ICLR 2023

Table 5: Raw templates used for model prompting in our experiments

Usage

Template

 

 

 

 

Instruction: [INSTRUCTION]

 

 

Zero-shot Evaluation

Input: [

]\nOutput:<COMPLETE>0

 

 

 

 

 

 

Instruction: [INSTRUCTION]

 

 

Few-shot Evaluation

Input: [

]\nOutput: [ ]\n\nInput: [

]\nOutput: [

] ...0

 

Input: [

]\nOutput:<COMPLETE>0

 

 

 

I gave a friend an instruction and five inputs. The friend read the instruction and

 

wrote an output for every one of the inputs.\nHere are the input-output pairs: 0

Forward Generation

Input: [

]\nOutput: [ ]\n\nInput: [

]\nOutput: [

] ...0

 

 

The instruction was<COMPLETE> 0

 

 

 

I instructed my friend to<INSERT>.The friend read the instruction and wrote an

Reverse Generation 1

output for every one of the inputs.\nHere are the input-output pairs: 0

 

 

 

 

 

Input: [

]\nOutput: [ ]\n\nInput: [

]\nOutput: [

] ...0

Professor Smith was given the following instructions:<INSERT>\nHere are the

Professor’s responses: 0

Reverse Generation 2

Q: []\nA: []\n\nQ: []\nA: [] ...0

Generate a variation of the following instruction while keeping the semantic

meaning.

Resample Instruction

Input: [INSTRUCTION]\nOutput:<COMPLETE>

Instruction: Answer the following question.

Zero-shot-CoT

Q: [INPUT]\nA: Let's <INSERT>. [OUTPUT]

19

Published as a conference paper at ICLR 2023

CADDITIONAL RESULTS

C.1 INSTRUCTION INDUCTION

Few-shot In-context Learning We evaluated APE-generated instructions in the few-shot in-context learning, where we insert the instruction before the in-context demonstrations. Those instructions are selected based on zero-shot execution accuracy, and we denote this setting as “Instruction + In-context” in Figure 8. As shown in Figure 8, adding an instruction achieves a comparable or better test performance than the standard in-context learning performance on 21 of 24 tasks. Counterintuitively, adding in-context examples for Rhymes, Large Animal, and Second Letters hurts model performance. We conjecture that it may be because the selected instructions overfit the zero-shot learning scenario and thus do not perform well on the few-shot case. Therefore, we experiment using few-shot execution accuracy as the selection metric. Figure 14 shows that the few-shot metric achieves comparable or slightly better than the zero-shot metric except for Rhymes. To have an intuitive understanding of what is happening, we provide a qualitative analysis below.

Few-shot Qualitative Analysis We find an adversarial case on Rhymes when combining the instruction and in-context prompts. Table 8 shows that 4 of 5 filtered instructions ask to echo the input word. These proposals effectively hack the evaluation with near-perfect test accuracy, as every word rhymes with itself. However, adding in-context examples for these instructions creates a misalignment between instruction (induces trivial rhymes) and context (induces non-trivial rhymes), resulting in a significant drop in performance. If we instead score the instructions based on the few-shot metric, this performance drop can be alleviated since the model can choose a more aligned instruction.

ExecutionAccuracy

 

 

 

 

 

 

 

 

 

 

Instruction Only

 

In-context Only

Instruction + In-context

 

 

 

 

 

 

 

 

 

1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Antonyms

Cause Selection

Common Concept

Diff

First Letter

 

Formality

Large Animal

 

List Letters

 

 

 

 

1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Membership

 

Negation

Number to Word

Passivization

Pluralization

 

Rhymes

Second Letter

Sentence Similarity

 

 

 

 

1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0

Sentiment

Starting With

Sum

Synonyms

Translation en-de Translation en-es

Translation en-fr

Word in Context

 

Figure 8: Few-shot in-context test accuracy on 24 Instruction Induction tasks. APE improves the few-shot in-context learning performance on 21 out of 24 tasks.

20