Добавил:

maiminh06 Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Национальный исследовательский ядерный университет (МИФИ)

Предмет:

[НЕСОРТИРОВАННОЕ]

Файл:

2211.5624_Large Language Models are Human-Level Prompt Engineers

.pdf

Скачиваний:

Добавлен:

08.01.2024

Размер:

3.83 Mб

Скачать

☆

<<< < Предыдущая 12 / 52 3 4 5 > Следующая >>>

Published as a conference paper at ICLR 2023

Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. GLM: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 320–335, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/ v1/2022.acl-long.26. URL https://aclanthology.org/2022.acl-long.26.

Kevin Ellis, Lucas Morales, Mathias Sablé-Meyer, Armando Solar-Lezama, and Josh Tenenbaum. Learning libraries of subroutines for neurally–guided bayesian program induction. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/ 7aa685b3b1dc1d6780bf36f7340078c9-Paper.pdf.

Kevin Ellis, Catherine Wong, Maxwell Nye, Mathias Sablé-Meyer, Lucas Morales, Luke Hewitt, Luc Cary, Armando Solar-Lezama, and Joshua B Tenenbaum. Dreamcoder: Bootstrapping inductive program synthesis with wake-sleep library learning. In Proceedings of the 42nd acm sigplan international conference on programming language design and implementation, pp. 835–850, 2021.

Tianyu Gao. Prompting: Better ways of using language models for nlp tasks. The Gradient, 2021.

Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 3816–3830, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.295. URL https://aclanthology.org/2021.acl-long. 295.

Sumit Gulwani, Oleksandr Polozov, Rishabh Singh, et al. Program synthesis. Foundations and Trends® in Programming Languages, 4(1-2):1–119, 2017.

Or Honovich, Uri Shaham, Samuel R Bowman, and Omer Levy. Instruction induction: From few examples to natural language task descriptions. arXiv preprint arXiv:2205.10782, 2022.

Naman Jain, Skanda Vaidyanath, Arun Iyer, Nagarajan Natarajan, Suresh Parthasarathy, Sriram Rajamani, and Rahul Sharma. Jigsaw: Large language models meet program synthesis. In

Proceedings of the 44th International Conference on Software Engineering, pp. 1219–1231, 2022.

Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham Neubig. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423–438, 2020.

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022.

Woosuk Lee, Kihong Heo, Rajeev Alur, and Mayur Naik. Accelerating search-based program synthesis using learned probabilistic models. ACM SIGPLAN Notices, 53(4):436–449, 2018.

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3045–3059, 2021.

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode. arXiv preprint arXiv:2203.07814, 2022.

Percy Liang, Michael I. Jordan, and Dan Klein. Learning programs: A hierarchical bayesian approach. In Johannes Fürnkranz and Thorsten Joachims (eds.), Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, pp. 639–646. Omnipress, 2010. URL https://icml.cc/Conferences/2010/papers/568.pdf.

Published as a conference paper at ICLR 2023

Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https: //aclanthology.org/2022.acl-long.229.

Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. Gpt understands, too. arXiv preprint arXiv:2103.10385, 2021.

Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786, 2021.

Dougal Maclaurin and Ryan Prescott Adams. Firefly monte carlo: Exact mcmc with subsets of data. In Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.

Aditya Menon, Omer Tamuz, Sumit Gulwani, Butler Lampson, and Adam Kalai. A machine learning framework for programming by example. In International Conference on Machine Learning, pp. 187–195. PMLR, 2013.

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021.

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.

Ethan Perez, Douwe Kiela, and Kyunghyun Cho. True few-shot learning with language models.

Advances in Neural Information Processing Systems, 34:11054–11070, 2021.

Guanghui Qin and Jason Eisner. Learning how to ask: Querying lms with mixtures of soft prompts. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 5203–5212, 2021.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J Liu, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67, 2020.

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical textconditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.

Laria Reynolds and Kyle McDonell. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–7, 2021.

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. Highresolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695, 2022.

Subhro Roy and Dan Roth. Solving general arithmetic word problems. arXiv preprint arXiv:1608.01413, 2016.

Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, 2022.

Timo Schick and Hinrich Schütze. Exploiting cloze-questions for few-shot text classification and natural language inference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 255–269, 2021.

Published as a conference paper at ICLR 2023

Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. AutoPrompt: Eliciting knowledge from language models with automatically generated prompts. In Empirical Methods in Natural Language Processing (EMNLP), 2020.

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.

Albert Webson and Ellie Pavlick. Do prompt-based models really understand the meaning of their prompts? arXiv preprint arXiv:2109.01247, 2021.

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2021.

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022a.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022b.

Catherine Wong, Kevin M Ellis, Joshua Tenenbaum, and Jacob Andreas. Leveraging language to learn program abstractions and search heuristics. In International Conference on Machine Learning, pp. 11193–11204. PMLR, 2021.

Weizhe Yuan, Graham Neubig, and Pengfei Liu. Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34:27263–27277, 2021.

Eric Zelikman, Yuhuai Wu, and Noah D Goodman. Star: Bootstrapping reasoning with reasoning. arXiv preprint arXiv:2203.14465, 2022.

Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.

Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.

Published as a conference paper at ICLR 2023

APROMPT ENGINEERING IN THE WILD

Large models with natural language interfaces, including models for text generation and image synthesis, have seen an increasing amount of public usage in recent years. As finding the right prompt can be difficult for humans, a number of guides on prompt engineering as well as tools to aid in prompt discovery have been developed. Among others, see, for example:

•https://blog.andrewcantino.com/blog/2021/04/21/prompt-engineering-tips-and-tricks/

•https://techcrunch.com/2022/07/29/a-startup-is-charging-1-99-for-strings-of-text-to-feed-to-dall-e-2/

•https://news.ycombinator.com/item?id=32943224

•https://promptomania.com/stable-diffusion-prompt-builder/

•https://huggingface.co/spaces/Gustavosta/MagicPrompt-Stable-Diffusion

In this paper we apply APE to generate effective instructions for steering LLMs, but the general framework Algorithm 1 could be applied to steer other models with natural language interfaces so long as an appropriate proposal method and scoring function can be designed.

Published as a conference paper at ICLR 2023

BIMPLEMENTATION DETAILS

Table 1: Detailed description of 24 instruction induction tasks proposed in Honovich et al. (2022). For convenience, the original table from Honovich et al. (2022) is duplicated here.

Category	Task	Instruction	Demonstration
Spelling	First Letter	Extract the first letter of the input word.	cat → c
	Second Letter	Extract the second letter of the input word.	cat → a
	List Letters	Break the input word into letters, sepa-	cat → c a t
		rated by spaces.
	Starting With	Extract the words starting with a given	The man whose car I hit last week
		letter from the input sentence.	sued me. [m] → man, me
Morpho-	Pluralization	Convert the input word to its plural form.	cat → cats
syntax
	Passivization	Write the input sentence in passive form.	The artist introduced the scientist.
			→ The scientist was introduced
			by the artist.
Syntax	Negation	Negate the input sentence.	Time is finite → Time is not fi-
			nite.
Lexical	Antonyms	Write a word that means the opposite of	won → lost
Semantics		the input word.
	Synonyms	Write a word with a similar meaning to	alleged → supposed
		the input word.

Membership	Write all the animals that appear in the
	given list.

cat, helicopter, cook, whale, frog, lion → frog, cat, lion, whale

Phonetics	Rhymes	Write a word that rhymes with the input	sing → ring
		word.
Knowledge	Larger Animal	Write the larger of the two given animals.	koala, snail → koala

Semantics Cause Selection Find which of the two given cause and effect sentences is the cause.

Sentence 1: The soda went flat. Sentence 2: The bottle was left open. → The bottle was left open.

	Common	Find a common characteristic for the given	guitars, pendulums, neutrinos →
	Concept	objects.	involve oscillations.
Style	Formality	Rephrase the sentence in formal language.	Please call once you get there →
			Please call upon your arrival.
Numerical	Sum	Sum the two given numbers.	22 10 → 32
	Difference	Subtract the second number from the first.	32 22 → 10
	Number to Word	Write the number in English words.	26 → twenty-six
Multi-	Translation	Translate the word into German / Spanish	game → juego
lingual		/ French.
GLUE	Sentiment	Determine whether a movie review is pos-	The film is small in scope, yet per-
	Analysis	itive or negative.	fectly formed. → positive
	Sentence	Rate the semantic similarity of two input	Sentence 1: A man is smoking.
	Similarity	sentences on a scale of 0 - definitely not	Sentence 2: A man is skating. →
		to 5 - perfectly.	0 - definitely not
	Word in Context	Determine whether an input word has the	Sentence 1: Approach a task. Sen-
		same meaning in the two input sentences.	tence 2: To approach the city.
			Word: approach → not the same

Published as a conference paper at ICLR 2023

Table 2: Detailed description of BIG-Bench Instruction Induction (BBII), a clean and tractable subset of 21 tasks that have a clear human written instruction that can be applied to all examples in the dataset.

Name		Description	Keywords
			causal reasoning, common sense, multi-
causal judgment		Answer questions about causal attribution	ple choice, reading comprehension, so-
			cial reasoning

disambiguation qa		Clarify the meaning of sentences with am-	common sense, gender bias, many-shot,
disambiguation qa		biguous pronouns	multiple choice
		biguous pronouns	multiple choice

dyck languages		Correctly close a Dyck-n word	algebra, arithmetic, logical reasoning,
dyck languages		Correctly close a Dyck-n word	multiple choice
			multiple choice

		Determine whether one sentence entails the	common sense, logical reasoning, mul-
epistemic reasoning		Determine whether one sentence entails the	tiple choice, social reasoning, theory of
epistemic reasoning		next	tiple choice, social reasoning, theory of
		next	mind
			mind

gender	inclusive	Given a German language sentence that does	free response, grammar, inclusion, non-
gender	inclusive	not use gender-inclusive forms, transform it	free response, grammar, inclusion, non-
sentences german		not use gender-inclusive forms, transform it	English, paraphrase
sentences german		to gender-inclusive forms	English, paraphrase
		to gender-inclusive forms

		Predict whether Speaker 2’s answer to	contextual question-answering, multiple
implicatures		Predict whether Speaker 2’s answer to	choice, reading comprehension, social
implicatures		Speaker 1 counts as a yes or as a no	choice, reading comprehension, social
		Speaker 1 counts as a yes or as a no	reasoning, theory of mind
			reasoning, theory of mind

			free response, human-like behavior, lin-
linguistics puzzles		Solve Rosetta Stone-style linguistics puzzles	guistics, logical reasoning, reading com-
			prehension

logical	fallacy de-	Detect informal and formal logical fallacies	logical reasoning, multiple choice
tection		Detect informal and formal logical fallacies	logical reasoning, multiple choice
tection

movie recommenda-		Recommend movies similar to the given list	emotional intelligence, multiple choice
tion		of movies	emotional intelligence, multiple choice
tion		of movies

		Given a series of navigation instructions, de-	arithmetic, logical reasoning, mathemat-
navigate		termine whether one would end up back at	arithmetic, logical reasoning, mathemat-
navigate		termine whether one would end up back at	ics, multiple choice
		the starting point	ics, multiple choice
		the starting point

		Questions that involve enumerating objects
object counting		of different types and asking the model to	free response, logical reasoning
		count them

operators		Given a mathematical operator definition in	free response, mathematics, numerical
operators		natural language, apply it	response
		natural language, apply it	response

presuppositions as		Determine whether the first sentence entails	common sense, logical reasoning, multi-
nli		or contradicts the second	ple choice

		Given a short answer along with its context,	multiple choice, paraphrase,	reading
question selection		select the most appropriate question which	multiple choice, paraphrase,	reading
question selection		select the most appropriate question which	comprehension, summarization
		to the given short answer	comprehension, summarization
		to the given short answer

ruin names		Select the humorous edit that ’ruins’ the in-	emotional understanding,	multiple
ruin names		put movie or musical artist name	choice
		put movie or musical artist name	choice

snarks		Determine which of two sentences is sarcas-	emotional understanding, humor, multi-
snarks		tic	ple choice
		tic	ple choice

sports understand-

Determine whether an artificially con-

ing

structed sentence relating to sports is plausible or implausible

common sense, context-free question answering, domain specific, multiple choice

tense	Modify the tense of a given sentence	free response, paraphrase, syntax
winowhy	Evaluate the reasoning in answering Wino-	causal reasoning, common sense, multi-
winowhy	grad Schema Challenge questions	ple choice, social reasoning
	grad Schema Challenge questions	ple choice, social reasoning

word sorting	Sort a list of words	algorithms, free response

word unscrambling	Unscramble the given letters to form an En-	free response, implicit reasoning, tok-
word unscrambling	glish word	enization
	glish word	enization

Published as a conference paper at ICLR 2023

B.1 BIG-BENCH INSTRUCTION INDUCTION (BBII) SELECTION PROCESS

Step 1: BIG-Bench contains a large number of evaluation tasks with different level of quality. For example, some of the tasks only have the minimum number of examples needed to qualify for submission, while other tasks may lack an appropriate human baselines. Therefore, we follow Suzgun et al. (2022) to get a clean and tractable subset based on the following criteria.

Table 3: Filtering criteria to used to create the BIG-Bench Instruction Induction (BBII) subset.

# Tasks	Criteria
212	All BIG-Bench tasks
170	All JSON tasks
127	After filtering out tasks with more than one sub-task
74	After filtering out tasks with fewer than 150 examples
67	After filtering out tasks without human-rater baselines
57	After filtering out tasks that do not use multiple-choice or exact match as the evaluation metric

Criteria: JSON Tasks.

Discarded tasks: abstraction and reasoning corpus, bbq lite, bias from probabilities, boolean expressions, com2sense, context definition alignment, convinceme, coqa conversational question answering, cycled letters, diverse social bias, dynamic counting, factuality of summary, forecasting subquestions, gender sensitivity chinese, gender sensitivity english, high low game, long context integration, multistep arithmetic, muslim violence bias, program synthesis, protein interacting sites, python programming challenge, question answer creation, roots optimization and games, self awareness, self evaluation courtroom, self evaluation tutoring, simple arithmetic, spelling bee, squad shifts, subject verb agreement, sudoku, taboo, talkdown, text navigation game, training on test set, truthful qa, twenty questions, unqover, web of lies, word problems on sets and graphs, yes no black white.

Criteria: Tasks without sub-task.

Discarded tasks: abstract narrative understanding, arithmetic, authorship verification, bbq lite json, cause and effect, chess state tracking, cifar10 classification, color, conceptual combinations, conlang translation, cs algorithms, elementary math qa, fact checker, gem, goal step wikihow, hhh alignment, indic cause and effect, intersect geometry, kanji ascii, key value maps, language games, linguistic mappings, list functions, logical deduction, metaphor understanding, minute mysteries qa, modified arithmetic, mult data wrangling, multiemo, natural instructions, periodic elements, physics, real or fake text, simp turing concept, simple arithmetic json subtasks, simple ethical questions, strange stories, symbol interpretation, tracking shuffled objects, undo permutation, unit conversion, unit interpretation, unnatural in context learning.

Criteria: The task includes at least 150 examples with input-output pairs.

Discarded tasks: analytic entailment, auto debugging, code line description, codenames, common morpheme, crash blossom, crass ai, cryobiology spanish, dark humor detection, emoji movie, emojis emotion prediction, empirical judgments, english proverbs, english russian proverbs, entailed polarity, entailed polarity hindi, evaluating information essentiality, figure of speech detection, general knowledge, gre reading comprehension, human organs senses, identify math theorems, identify odd metaphor, implicit relations, international phonetic alphabet nli, irony identification, known unknowns, logical args, logical sequence, mathematical induction, misconceptions russian, nonsense words grammar, novel concepts, odd one out, penguins in a table, persian idioms, phrase relatedness, physical intuition, physics questions, repeat copy logic, rephrase, riddle sense, scientific press release, sentence ambiguity, similarities abstraction, simple arithmetic json, simple arithmetic json multiple choice, simple arithmetic multiple targets json, simple text editing, sufficient information, suicide risk, swedish to german proverbs, what is the tao.

Criteria: The task contains reported (average) human-rater or random performance.

Discarded tasks: contextual parametric knowledge conflicts, hinglish toxicity, medical questions russian, parsinlu qa, swahili english proverbs, tellmewhy, which wiki edit.

Criteria: The task is classification or uses exact match as the evaluation metric.

Discarded tasks: auto categorization, few shot nlg, hindi question answering, international phonetic alphabet transliterate, polish sequence labeling, qa wikidata, semantic parsing in context sparc, semantic parsing spider, social support, topical chat.

Published as a conference paper at ICLR 2023

Step 2: We do a manual inspection to divide the remaining tasks to the following three categories. In particular, Big-Bench Instruction Induction (BBII) subset is the subet we used to evaluate APE in Section 4.2.

•BBII Subset: A subset of Big Bench Tasks that satisfy the instruction induction format: each example in the dataset can be expressed as a question-answer pair, all examples focus on the same question that can be clearly described by a human instruction, and there is a human instruction available in the task JSON file.

•Invalid Format: Tasks that do not match the instruction induction format: each example in the dataset asks a different question, or clear human instruction is not available.

•Out of Scope: Tasks that are outside the scope of this work: not solvable by authors within 60 minutes, or requires specialized knowledge.

Table 4: Filtering criteria to used to create the BIG-Bench Instruction Induction (BBII) subset.

# Category	# Tasks	Tasks Names
		causal judgment, disambiguation qa, dyck language, epistemic reasoning, gender
		inclusive sentences german, implicatures, linguistics puzzles, logical fallacy
BBII Subset	21	detection, movie recommendation, navigate, object counting, operators, presup-
		positions as nli, question selection, ruin names, snarks, sports understanding,
		tense, winowhy, word sorting, word unscrambling.

		anachronisms, analogical similarity, bridging anaphora resolution barqa, data
		understanding, disfl qa, fantasy reasoning, formal fallacies syllogisms negation,
Invalid Format	21	hindu knowledge, hyperbaton, intent recognition, logic grid puzzle, paragraph
Invalid Format	21	segmentation, play dialog same or different, reasoning about colored objects,

		salient translation error detection, social iqa, strategyqa, temporal sequences,
		timedial, understanding fables, vitaminc fact verification.

		ascii word recognition, checkmate in one, chinese remainder theorem, cryptonite,
Out of Scope	13	discourse marker prediction, geometric shapes, kannada, language identification,
Out of Scope	13	matrixshapes, mnist ascii, moral permissibility, movie dialog same or different,

parsinlu reading comprehension.

Published as a conference paper at ICLR 2023

Table 5: Raw templates used for model prompting in our experiments

Usage	Template
	Instruction: [INSTRUCTION]
Zero-shot Evaluation	Input: [	]\nOutput:<COMPLETE>0
	Input: [	]\nOutput:<COMPLETE>0
	Instruction: [INSTRUCTION]
Few-shot Evaluation	Input: [	]\nOutput: [ ]\n\nInput: [	]\nOutput: [	] ...0
	Input: [	]\nOutput:<COMPLETE>0
	I gave a friend an instruction and five inputs. The friend read the instruction and
	wrote an output for every one of the inputs.\nHere are the input-output pairs: 0
Forward Generation	Input: [	]\nOutput: [ ]\n\nInput: [	]\nOutput: [	] ...0
	Input: [	]\nOutput: [ ]\n\nInput: [	]\nOutput: [	] ...0
	The instruction was<COMPLETE> 0
	I instructed my friend to<INSERT>.The friend read the instruction and wrote an
Reverse Generation 1	output for every one of the inputs.\nHere are the input-output pairs: 0
Reverse Generation 1
	Input: [	]\nOutput: [ ]\n\nInput: [	]\nOutput: [	] ...0

Professor Smith was given the following instructions:<INSERT>\nHere are the

Professor’s responses: 0

Reverse Generation 2

Q: []\nA: []\n\nQ: []\nA: [] ...0

Generate a variation of the following instruction while keeping the semantic

meaning.

Resample Instruction

Input: [INSTRUCTION]\nOutput:<COMPLETE>

Instruction: Answer the following question.

Zero-shot-CoT

Q: [INPUT]\nA: Let's <INSERT>. [OUTPUT]

Published as a conference paper at ICLR 2023

CADDITIONAL RESULTS

C.1 INSTRUCTION INDUCTION

Few-shot In-context Learning We evaluated APE-generated instructions in the few-shot in-context learning, where we insert the instruction before the in-context demonstrations. Those instructions are selected based on zero-shot execution accuracy, and we denote this setting as “Instruction + In-context” in Figure 8. As shown in Figure 8, adding an instruction achieves a comparable or better test performance than the standard in-context learning performance on 21 of 24 tasks. Counterintuitively, adding in-context examples for Rhymes, Large Animal, and Second Letters hurts model performance. We conjecture that it may be because the selected instructions overfit the zero-shot learning scenario and thus do not perform well on the few-shot case. Therefore, we experiment using few-shot execution accuracy as the selection metric. Figure 14 shows that the few-shot metric achieves comparable or slightly better than the zero-shot metric except for Rhymes. To have an intuitive understanding of what is happening, we provide a qualitative analysis below.

Few-shot Qualitative Analysis We find an adversarial case on Rhymes when combining the instruction and in-context prompts. Table 8 shows that 4 of 5 filtered instructions ask to echo the input word. These proposals effectively hack the evaluation with near-perfect test accuracy, as every word rhymes with itself. However, adding in-context examples for these instructions creates a misalignment between instruction (induces trivial rhymes) and context (induces non-trivial rhymes), resulting in a significant drop in performance. If we instead score the instructions based on the few-shot metric, this performance drop can be alleviated since the model can choose a more aligned instruction.

ExecutionAccuracy

Instruction Only

In-context Only

Instruction + In-context

Antonyms

Cause Selection

Common Concept

Diff

First Letter

Formality

Large Animal

List Letters

Membership

Negation

Number to Word

Passivization

Pluralization

Rhymes

Second Letter

Sentence Similarity

0	Sentiment	Starting With	Sum	Synonyms	Translation en-de Translation en-es	Translation en-fr	Word in Context

Figure 8: Few-shot in-context test accuracy on 24 Instruction Induction tasks. APE improves the few-shot in-context learning performance on 21 out of 24 tasks.

<<< < Предыдущая 12 / 52 3 4 5 > Следующая >>>

Соседние файлы в папке AI

#
08.01.20243.83 Mб02211.5624_Large Language Models are Human-Level Prompt Engineers.pdf
#
08.01.2024412.34 Кб02302 A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT.pdf
#
08.01.20245.58 Mб02303.13534v2 Prompting AI Art - An Investigation into the Creative Skill of Prompt Engineering.pdf
#
08.01.20241.12 Mб02305.03495 Prompt Optimization with Gradient Descent and Beam Search.pdf
#
08.01.2024795.17 Кб02307.08715 MASTERKEY-Automated Jailbreaking of LLM chatbots.pdf
#
08.01.20242.34 Mб02328.2023 Johnny Can’t Prompt_How Non-AI Experts Try (and fail) to Design LLM prompts.pdf