Published as a conference paper at ICLR 2023
Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. GLM: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 320–335, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/ v1/2022.acl-long.26. URL https://aclanthology.org/2022.acl-long.26.
Kevin Ellis, Lucas Morales, Mathias Sablé-Meyer, Armando Solar-Lezama, and Josh Tenenbaum. Learning libraries of subroutines for neurally–guided bayesian program induction. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/ 7aa685b3b1dc1d6780bf36f7340078c9-Paper.pdf.
Kevin Ellis, Catherine Wong, Maxwell Nye, Mathias Sablé-Meyer, Lucas Morales, Luke Hewitt, Luc Cary, Armando Solar-Lezama, and Joshua B Tenenbaum. Dreamcoder: Bootstrapping inductive program synthesis with wake-sleep library learning. In Proceedings of the 42nd acm sigplan international conference on programming language design and implementation, pp. 835–850, 2021.
Tianyu Gao. Prompting: Better ways of using language models for nlp tasks. The Gradient, 2021.
Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 3816–3830, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.295. URL https://aclanthology.org/2021.acl-long. 295.
Sumit Gulwani, Oleksandr Polozov, Rishabh Singh, et al. Program synthesis. Foundations and Trends® in Programming Languages, 4(1-2):1–119, 2017.
Or Honovich, Uri Shaham, Samuel R Bowman, and Omer Levy. Instruction induction: From few examples to natural language task descriptions. arXiv preprint arXiv:2205.10782, 2022.
Naman Jain, Skanda Vaidyanath, Arun Iyer, Nagarajan Natarajan, Suresh Parthasarathy, Sriram Rajamani, and Rahul Sharma. Jigsaw: Large language models meet program synthesis. In
Proceedings of the 44th International Conference on Software Engineering, pp. 1219–1231, 2022.
Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham Neubig. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423–438, 2020.
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022.
Woosuk Lee, Kihong Heo, Rajeev Alur, and Mayur Naik. Accelerating search-based program synthesis using learned probabilistic models. ACM SIGPLAN Notices, 53(4):436–449, 2018.
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3045–3059, 2021.
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode. arXiv preprint arXiv:2203.07814, 2022.
Percy Liang, Michael I. Jordan, and Dan Klein. Learning programs: A hierarchical bayesian approach. In Johannes Fürnkranz and Thorsten Joachims (eds.), Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, pp. 639–646. Omnipress, 2010. URL https://icml.cc/Conferences/2010/papers/568.pdf.
11
Published as a conference paper at ICLR 2023
Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https: //aclanthology.org/2022.acl-long.229.
Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. Gpt understands, too. arXiv preprint arXiv:2103.10385, 2021.
Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786, 2021.
Dougal Maclaurin and Ryan Prescott Adams. Firefly monte carlo: Exact mcmc with subsets of data. In Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.
Aditya Menon, Omer Tamuz, Sumit Gulwani, Butler Lampson, and Adam Kalai. A machine learning framework for programming by example. In International Conference on Machine Learning, pp. 187–195. PMLR, 2013.
Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021.
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
Ethan Perez, Douwe Kiela, and Kyunghyun Cho. True few-shot learning with language models.
Advances in Neural Information Processing Systems, 34:11054–11070, 2021.
Guanghui Qin and Jason Eisner. Learning how to ask: Querying lms with mixtures of soft prompts. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 5203–5212, 2021.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J Liu, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67, 2020.
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical textconditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
Laria Reynolds and Kyle McDonell. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–7, 2021.
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. Highresolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695, 2022.
Subhro Roy and Dan Roth. Solving general arithmetic word problems. arXiv preprint arXiv:1608.01413, 2016.
Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, 2022.
Timo Schick and Hinrich Schütze. Exploiting cloze-questions for few-shot text classification and natural language inference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 255–269, 2021.
12
Published as a conference paper at ICLR 2023
Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. AutoPrompt: Eliciting knowledge from language models with automatically generated prompts. In Empirical Methods in Natural Language Processing (EMNLP), 2020.
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
Albert Webson and Ellie Pavlick. Do prompt-based models really understand the meaning of their prompts? arXiv preprint arXiv:2109.01247, 2021.
Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2021.
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022a.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022b.
Catherine Wong, Kevin M Ellis, Joshua Tenenbaum, and Jacob Andreas. Leveraging language to learn program abstractions and search heuristics. In International Conference on Machine Learning, pp. 11193–11204. PMLR, 2021.
Weizhe Yuan, Graham Neubig, and Pengfei Liu. Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34:27263–27277, 2021.
Eric Zelikman, Yuhuai Wu, and Noah D Goodman. Star: Bootstrapping reasoning with reasoning. arXiv preprint arXiv:2203.14465, 2022.
Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
13
Published as a conference paper at ICLR 2023
APROMPT ENGINEERING IN THE WILD
Large models with natural language interfaces, including models for text generation and image synthesis, have seen an increasing amount of public usage in recent years. As finding the right prompt can be difficult for humans, a number of guides on prompt engineering as well as tools to aid in prompt discovery have been developed. Among others, see, for example:
•https://blog.andrewcantino.com/blog/2021/04/21/prompt-engineering-tips-and-tricks/
•https://techcrunch.com/2022/07/29/a-startup-is-charging-1-99-for-strings-of-text-to-feed-to-dall-e-2/
•https://news.ycombinator.com/item?id=32943224
•https://promptomania.com/stable-diffusion-prompt-builder/
•https://huggingface.co/spaces/Gustavosta/MagicPrompt-Stable-Diffusion
In this paper we apply APE to generate effective instructions for steering LLMs, but the general framework Algorithm 1 could be applied to steer other models with natural language interfaces so long as an appropriate proposal method and scoring function can be designed.
14
Published as a conference paper at ICLR 2023
BIMPLEMENTATION DETAILS
Table 1: Detailed description of 24 instruction induction tasks proposed in Honovich et al. (2022). For convenience, the original table from Honovich et al. (2022) is duplicated here.
Category |
Task |
Instruction |
Demonstration |
|
Spelling |
First Letter |
Extract the first letter of the input word. |
cat → c |
|
|
Second Letter |
Extract the second letter of the input word. |
cat → a |
|
|
List Letters |
Break the input word into letters, sepa- |
cat → c a t |
|
|
|
rated by spaces. |
|
|
|
Starting With |
Extract the words starting with a given |
The man whose car I hit last week |
|
|
|
letter from the input sentence. |
sued me. [m] → man, me |
|
Morpho- |
Pluralization |
Convert the input word to its plural form. |
cat → cats |
|
syntax |
|
|
|
|
|
Passivization |
Write the input sentence in passive form. |
The artist introduced the scientist. |
|
|
|
|
→ The scientist was introduced |
|
|
|
|
by the artist. |
|
Syntax |
Negation |
Negate the input sentence. |
Time is finite → Time is not fi- |
|
|
|
|
nite. |
|
Lexical |
Antonyms |
Write a word that means the opposite of |
won → lost |
|
Semantics |
|
the input word. |
|
|
|
Synonyms |
Write a word with a similar meaning to |
alleged → supposed |
|
|
|
the input word. |
|
|
Membership |
Write all the animals that appear in the |
|
given list. |
cat, helicopter, cook, whale, frog, lion → frog, cat, lion, whale
Phonetics |
Rhymes |
Write a word that rhymes with the input |
sing → ring |
|
|
word. |
|
Knowledge |
Larger Animal |
Write the larger of the two given animals. |
koala, snail → koala |
Semantics Cause Selection Find which of the two given cause and effect sentences is the cause.
Sentence 1: The soda went flat. Sentence 2: The bottle was left open. → The bottle was left open.
|
Common |
Find a common characteristic for the given |
guitars, pendulums, neutrinos → |
|
|
Concept |
objects. |
involve oscillations. |
|
Style |
Formality |
Rephrase the sentence in formal language. |
Please call once you get there → |
|
|
|
|
Please call upon your arrival. |
|
Numerical |
Sum |
Sum the two given numbers. |
22 10 → 32 |
|
|
Difference |
Subtract the second number from the first. |
32 22 → 10 |
|
|
Number to Word |
Write the number in English words. |
26 → twenty-six |
|
Multi- |
Translation |
Translate the word into German / Spanish |
game → juego |
|
lingual |
|
/ French. |
|
|
GLUE |
Sentiment |
Determine whether a movie review is pos- |
The film is small in scope, yet per- |
|
|
Analysis |
itive or negative. |
fectly formed. → positive |
|
|
Sentence |
Rate the semantic similarity of two input |
Sentence 1: A man is smoking. |
|
|
Similarity |
sentences on a scale of 0 - definitely not |
Sentence 2: A man is skating. → |
|
|
|
to 5 - perfectly. |
0 - definitely not |
|
|
Word in Context |
Determine whether an input word has the |
Sentence 1: Approach a task. Sen- |
|
|
|
same meaning in the two input sentences. |
tence 2: To approach the city. |
|
|
|
|
Word: approach → not the same |
|
15
Published as a conference paper at ICLR 2023
Table 2: Detailed description of BIG-Bench Instruction Induction (BBII), a clean and tractable subset of 21 tasks that have a clear human written instruction that can be applied to all examples in the dataset.
Name |
|
Description |
Keywords |
|
|
|
|
|
causal reasoning, common sense, multi- |
||
causal judgment |
Answer questions about causal attribution |
ple choice, reading comprehension, so- |
|||
|
|
|
cial reasoning |
|
|
|
|
|
|
||
disambiguation qa |
Clarify the meaning of sentences with am- |
common sense, gender bias, many-shot, |
|||
biguous pronouns |
multiple choice |
|
|||
|
|
|
|||
|
|
|
|
||
dyck languages |
Correctly close a Dyck-n word |
algebra, arithmetic, logical reasoning, |
|||
multiple choice |
|
||||
|
|
|
|
||
|
|
|
|
||
|
|
Determine whether one sentence entails the |
common sense, logical reasoning, mul- |
||
epistemic reasoning |
tiple choice, social reasoning, theory of |
||||
next |
|||||
|
|
mind |
|
||
|
|
|
|
||
|
|
|
|
|
|
gender |
inclusive |
Given a German language sentence that does |
free response, grammar, inclusion, non- |
||
not use gender-inclusive forms, transform it |
|||||
sentences german |
English, paraphrase |
|
|||
to gender-inclusive forms |
|
||||
|
|
|
|
||
|
|
|
|
||
|
|
Predict whether Speaker 2’s answer to |
contextual question-answering, multiple |
||
implicatures |
choice, reading comprehension, social |
||||
Speaker 1 counts as a yes or as a no |
|||||
|
|
reasoning, theory of mind |
|
||
|
|
|
|
||
|
|
|
|
||
|
|
|
free response, human-like behavior, lin- |
||
linguistics puzzles |
Solve Rosetta Stone-style linguistics puzzles |
guistics, logical reasoning, reading com- |
|||
|
|
|
prehension |
|
|
|
|
|
|
|
|
logical |
fallacy de- |
Detect informal and formal logical fallacies |
logical reasoning, multiple choice |
||
tection |
|
||||
|
|
|
|
||
|
|
|
|
||
movie recommenda- |
Recommend movies similar to the given list |
emotional intelligence, multiple choice |
|||
tion |
|
of movies |
|||
|
|
|
|||
|
|
|
|
|
|
|
|
Given a series of navigation instructions, de- |
arithmetic, logical reasoning, mathemat- |
||
navigate |
|
termine whether one would end up back at |
|||
|
ics, multiple choice |
|
|||
|
|
the starting point |
|
||
|
|
|
|
||
|
|
|
|
|
|
|
|
Questions that involve enumerating objects |
|
|
|
object counting |
of different types and asking the model to |
free response, logical reasoning |
|||
|
|
count them |
|
|
|
|
|
|
|
||
operators |
Given a mathematical operator definition in |
free response, mathematics, numerical |
|||
natural language, apply it |
response |
|
|||
|
|
|
|||
|
|
|
|||
presuppositions as |
Determine whether the first sentence entails |
common sense, logical reasoning, multi- |
|||
nli |
|
or contradicts the second |
ple choice |
|
|
|
|
|
|
|
|
|
|
Given a short answer along with its context, |
multiple choice, paraphrase, |
reading |
|
question selection |
select the most appropriate question which |
||||
comprehension, summarization |
|||||
|
|
to the given short answer |
|||
|
|
|
|
||
|
|
|
|
|
|
ruin names |
Select the humorous edit that ’ruins’ the in- |
emotional understanding, |
multiple |
||
put movie or musical artist name |
choice |
|
|||
|
|
|
|||
|
|
|
|
||
snarks |
|
Determine which of two sentences is sarcas- |
emotional understanding, humor, multi- |
||
|
tic |
ple choice |
|
||
|
|
|
|||
sports understand-
Determine whether an artificially con-
ing
structed sentence relating to sports is plausible or implausible
common sense, context-free question answering, domain specific, multiple choice
tense |
Modify the tense of a given sentence |
free response, paraphrase, syntax |
|
winowhy |
Evaluate the reasoning in answering Wino- |
causal reasoning, common sense, multi- |
|
grad Schema Challenge questions |
ple choice, social reasoning |
||
|
|||
|
|
|
|
word sorting |
Sort a list of words |
algorithms, free response |
|
|
|
|
|
word unscrambling |
Unscramble the given letters to form an En- |
free response, implicit reasoning, tok- |
|
glish word |
enization |
||
|
|||
|
|
|
16
Published as a conference paper at ICLR 2023
B.1 BIG-BENCH INSTRUCTION INDUCTION (BBII) SELECTION PROCESS
Step 1: BIG-Bench contains a large number of evaluation tasks with different level of quality. For example, some of the tasks only have the minimum number of examples needed to qualify for submission, while other tasks may lack an appropriate human baselines. Therefore, we follow Suzgun et al. (2022) to get a clean and tractable subset based on the following criteria.
Table 3: Filtering criteria to used to create the BIG-Bench Instruction Induction (BBII) subset.
# Tasks |
Criteria |
212 |
All BIG-Bench tasks |
170 |
All JSON tasks |
127 |
After filtering out tasks with more than one sub-task |
74 |
After filtering out tasks with fewer than 150 examples |
67 |
After filtering out tasks without human-rater baselines |
57 |
After filtering out tasks that do not use multiple-choice or exact match as the evaluation metric |
|
|
Criteria: JSON Tasks.
Discarded tasks: abstraction and reasoning corpus, bbq lite, bias from probabilities, boolean expressions, com2sense, context definition alignment, convinceme, coqa conversational question answering, cycled letters, diverse social bias, dynamic counting, factuality of summary, forecasting subquestions, gender sensitivity chinese, gender sensitivity english, high low game, long context integration, multistep arithmetic, muslim violence bias, program synthesis, protein interacting sites, python programming challenge, question answer creation, roots optimization and games, self awareness, self evaluation courtroom, self evaluation tutoring, simple arithmetic, spelling bee, squad shifts, subject verb agreement, sudoku, taboo, talkdown, text navigation game, training on test set, truthful qa, twenty questions, unqover, web of lies, word problems on sets and graphs, yes no black white.
Criteria: Tasks without sub-task.
Discarded tasks: abstract narrative understanding, arithmetic, authorship verification, bbq lite json, cause and effect, chess state tracking, cifar10 classification, color, conceptual combinations, conlang translation, cs algorithms, elementary math qa, fact checker, gem, goal step wikihow, hhh alignment, indic cause and effect, intersect geometry, kanji ascii, key value maps, language games, linguistic mappings, list functions, logical deduction, metaphor understanding, minute mysteries qa, modified arithmetic, mult data wrangling, multiemo, natural instructions, periodic elements, physics, real or fake text, simp turing concept, simple arithmetic json subtasks, simple ethical questions, strange stories, symbol interpretation, tracking shuffled objects, undo permutation, unit conversion, unit interpretation, unnatural in context learning.
Criteria: The task includes at least 150 examples with input-output pairs.
Discarded tasks: analytic entailment, auto debugging, code line description, codenames, common morpheme, crash blossom, crass ai, cryobiology spanish, dark humor detection, emoji movie, emojis emotion prediction, empirical judgments, english proverbs, english russian proverbs, entailed polarity, entailed polarity hindi, evaluating information essentiality, figure of speech detection, general knowledge, gre reading comprehension, human organs senses, identify math theorems, identify odd metaphor, implicit relations, international phonetic alphabet nli, irony identification, known unknowns, logical args, logical sequence, mathematical induction, misconceptions russian, nonsense words grammar, novel concepts, odd one out, penguins in a table, persian idioms, phrase relatedness, physical intuition, physics questions, repeat copy logic, rephrase, riddle sense, scientific press release, sentence ambiguity, similarities abstraction, simple arithmetic json, simple arithmetic json multiple choice, simple arithmetic multiple targets json, simple text editing, sufficient information, suicide risk, swedish to german proverbs, what is the tao.
Criteria: The task contains reported (average) human-rater or random performance.
Discarded tasks: contextual parametric knowledge conflicts, hinglish toxicity, medical questions russian, parsinlu qa, swahili english proverbs, tellmewhy, which wiki edit.
Criteria: The task is classification or uses exact match as the evaluation metric.
Discarded tasks: auto categorization, few shot nlg, hindi question answering, international phonetic alphabet transliterate, polish sequence labeling, qa wikidata, semantic parsing in context sparc, semantic parsing spider, social support, topical chat.
17
Published as a conference paper at ICLR 2023
Step 2: We do a manual inspection to divide the remaining tasks to the following three categories. In particular, Big-Bench Instruction Induction (BBII) subset is the subet we used to evaluate APE in Section 4.2.
•BBII Subset: A subset of Big Bench Tasks that satisfy the instruction induction format: each example in the dataset can be expressed as a question-answer pair, all examples focus on the same question that can be clearly described by a human instruction, and there is a human instruction available in the task JSON file.
•Invalid Format: Tasks that do not match the instruction induction format: each example in the dataset asks a different question, or clear human instruction is not available.
•Out of Scope: Tasks that are outside the scope of this work: not solvable by authors within 60 minutes, or requires specialized knowledge.
Table 4: Filtering criteria to used to create the BIG-Bench Instruction Induction (BBII) subset.
# Category |
# Tasks |
Tasks Names |
|
|
|
causal judgment, disambiguation qa, dyck language, epistemic reasoning, gender |
|
|
|
inclusive sentences german, implicatures, linguistics puzzles, logical fallacy |
|
BBII Subset |
21 |
detection, movie recommendation, navigate, object counting, operators, presup- |
|
|
|
positions as nli, question selection, ruin names, snarks, sports understanding, |
|
|
|
tense, winowhy, word sorting, word unscrambling. |
|
|
|
|
|
|
|
anachronisms, analogical similarity, bridging anaphora resolution barqa, data |
|
|
|
understanding, disfl qa, fantasy reasoning, formal fallacies syllogisms negation, |
|
Invalid Format |
21 |
hindu knowledge, hyperbaton, intent recognition, logic grid puzzle, paragraph |
|
segmentation, play dialog same or different, reasoning about colored objects, |
|||
|
|
||
|
|
salient translation error detection, social iqa, strategyqa, temporal sequences, |
|
|
|
timedial, understanding fables, vitaminc fact verification. |
|
|
|
|
|
|
|
ascii word recognition, checkmate in one, chinese remainder theorem, cryptonite, |
|
Out of Scope |
13 |
discourse marker prediction, geometric shapes, kannada, language identification, |
|
matrixshapes, mnist ascii, moral permissibility, movie dialog same or different, |
|||
|
|
parsinlu reading comprehension.
18
Published as a conference paper at ICLR 2023
Table 5: Raw templates used for model prompting in our experiments
Usage |
Template |
|
|
|
|
Instruction: [INSTRUCTION] |
|
|
|
Zero-shot Evaluation |
Input: [ |
]\nOutput:<COMPLETE>0 |
|
|
|
|
|
||
|
Instruction: [INSTRUCTION] |
|
|
|
Few-shot Evaluation |
Input: [ |
]\nOutput: [ ]\n\nInput: [ |
]\nOutput: [ |
] ...0 |
|
Input: [ |
]\nOutput:<COMPLETE>0 |
|
|
|
I gave a friend an instruction and five inputs. The friend read the instruction and |
|||
|
wrote an output for every one of the inputs.\nHere are the input-output pairs: 0 |
|||
Forward Generation |
Input: [ |
]\nOutput: [ ]\n\nInput: [ |
]\nOutput: [ |
] ...0 |
|
||||
|
The instruction was<COMPLETE> 0 |
|
|
|
|
I instructed my friend to<INSERT>.The friend read the instruction and wrote an |
|||
Reverse Generation 1 |
output for every one of the inputs.\nHere are the input-output pairs: 0 |
|||
|
|
|
|
|
|
Input: [ |
]\nOutput: [ ]\n\nInput: [ |
]\nOutput: [ |
] ...0 |
Professor Smith was given the following instructions:<INSERT>\nHere are the
Professor’s responses: 0
Reverse Generation 2
Q: [
]\nA: [
]\n\nQ: [
]\nA: [
] ...0
Generate a variation of the following instruction while keeping the semantic
meaning.
Resample Instruction
Input: [INSTRUCTION]\nOutput:<COMPLETE>
Instruction: Answer the following question.
Zero-shot-CoT
Q: [INPUT]\nA: Let's <INSERT>. [OUTPUT]
19
Published as a conference paper at ICLR 2023
CADDITIONAL RESULTS
C.1 INSTRUCTION INDUCTION
Few-shot In-context Learning We evaluated APE-generated instructions in the few-shot in-context learning, where we insert the instruction before the in-context demonstrations. Those instructions are selected based on zero-shot execution accuracy, and we denote this setting as “Instruction + In-context” in Figure 8. As shown in Figure 8, adding an instruction achieves a comparable or better test performance than the standard in-context learning performance on 21 of 24 tasks. Counterintuitively, adding in-context examples for Rhymes, Large Animal, and Second Letters hurts model performance. We conjecture that it may be because the selected instructions overfit the zero-shot learning scenario and thus do not perform well on the few-shot case. Therefore, we experiment using few-shot execution accuracy as the selection metric. Figure 14 shows that the few-shot metric achieves comparable or slightly better than the zero-shot metric except for Rhymes. To have an intuitive understanding of what is happening, we provide a qualitative analysis below.
Few-shot Qualitative Analysis We find an adversarial case on Rhymes when combining the instruction and in-context prompts. Table 8 shows that 4 of 5 filtered instructions ask to echo the input word. These proposals effectively hack the evaluation with near-perfect test accuracy, as every word rhymes with itself. However, adding in-context examples for these instructions creates a misalignment between instruction (induces trivial rhymes) and context (induces non-trivial rhymes), resulting in a significant drop in performance. If we instead score the instructions based on the few-shot metric, this performance drop can be alleviated since the model can choose a more aligned instruction.
ExecutionAccuracy
|
|
|
|
|
|
|
|
|
|
Instruction Only |
|
In-context Only |
Instruction + In-context |
|
|
|
|
|
|
|
|
|
|||||||
1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Antonyms |
Cause Selection |
Common Concept |
Diff |
First Letter |
|
Formality |
Large Animal |
|
List Letters |
|||||||||||||||||||
|
|
|
|
||||||||||||||||||||||||||
1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Membership |
|
Negation |
Number to Word |
Passivization |
Pluralization |
|
Rhymes |
Second Letter |
Sentence Similarity |
|||||||||||||||||||
|
|
|
|
||||||||||||||||||||||||||
1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0 |
Sentiment |
Starting With |
Sum |
Synonyms |
Translation en-de Translation en-es |
Translation en-fr |
Word in Context |
|
Figure 8: Few-shot in-context test accuracy on 24 Instruction Induction tasks. APE improves the few-shot in-context learning performance on 21 out of 24 tasks.
20
