The AI New Era: How Should Large Models “Rack Their Brains”

9 min readAug 24, 2023

Welcome to follow the official OpenMMLab Twitter account to stay updated with our latest news.

As ChatGPT is launched, Large Language Models (LLMs) have been garnering significant attention in the AI field. However, despite their impressive capabilities, LLMs still face substantial challenges in handling complex multi-step reasoning tasks, such as mathematical applications and common-sense reasoning. This has led to the inclusion of more complex reasoning datasets like GSM8k and MATH in the evaluations of large models.

To address the shortcomings of LLMs in complex reasoning, researchers have been actively developing innovative techniques. Among these attempts, the “Chain-of-Thought Prompting” technique has gained special attention. This approach aims to guide the model in breaking down complex multi-step problems into more manageable intermediate steps, aiding the model in better understanding and solving problems accurately. Practical results have shown significant advancements in various reasoning tasks, especially arithmetic reasoning, through the application of Chain-of-Thought Prompting.

What is Chain of Thought？

Imagine a scenario where a teacher presents a challenging thinking and reasoning problem to a student named Tom: "In a cage on a farm, there are a total of chickens and rabbits, totaling 36 animals. The total number of legs is 100. Determine how many chickens and rabbits there are." Suppose Tom doesn't have paper and pen and must provide an answer directly. He attempts: "There are 20 chickens and 16 rabbits." ❌ , Unfortunately, there's an error in his mental calculation.

The teacher gives Tom a second chance, allowing him to solve the problem step by step using paper and pen. Tom records the intermediate steps:

Let x be the number of chickens, and y be the number of rabbits.
x + y = 36 (total number of animals is 36)
2x + 4y = 100 (total number of legs is 100)
Solve the first equation for one variable, for example, x = 36 - y.
Substitute the value of x into the second equation:
2(36 - y) + 4y = 100
72 - 2y + 4y = 100
2y = 28
y = 14
Substitute the value of y back into the first equation to find x:
x = 36 - 14
x = 22
Hence, the answer is:
There are 22 chickens and 14 rabbits. ✅

Through step-by-step reasoning, Tom successfully arrives at the correct answer and earns the teacher’s approval :>

Similarly, when using large language models, you can prompt them step by step, just like the teacher helped Tom, guiding the model to solve complex problems. This is the essence of the earliest Chain of Thought : few-shot Chain of Thought. It involves prompting the model with a few-shot example that includes intermediate reasoning steps before the answer. For example:

Few-Shot CoT：Unveiling the Chain of Thought

Chain of Thought Prompting Elicits Reasoning in Large Language Models.

The early version of Chain of Thought is suitable for Few-Shot prompts. Compared to standard Few-Shot prompts, Chain of Thought Few-Shot prompts only add reasoning steps before the answer. For instance, an original prompt might look like:

Sample problem + Answer + Actual problem

The input to the model directly yields the problem’s answer. With the addition of Chain of Thought to the prompt, it becomes:

Sample problem + Sample reasoning steps + Answer + Actual problem

The sample reasoning steps are the Chain of Thought used to solve the original problem. This guides the model to generate intermediate steps before outputting the answer, breaking down the original problem into multiple sub-problems and aiding the model’s “thinking”. Not only does this method significantly enhance the model’s reasoning ability without requiring model modifications, but it also yields immediate effects. On the PaLM-540B dataset, there was an almost threefold improvement using Chain of Thought compared to the traditional method of boosting model performance through fine-tuning. Chain of Thought can be seen as opening new doors for enhancing large model reasoning.

So, are there any simpler ways to implement the Chain of Thought, and is there a Zero-Shot version of the Chain of Thought implementation? Indeed, there is.

Zero-Shot CoT：Simple yet Effective

Large Language Models Are Zero-Shot Reasoners

Perhaps the simplest CoT method:

You can easily implement a Zero-Shot CoT prompt by simply adding “Let’s think step by step” after the question. This effortlessly implements a zero-shot Chain of Thought prompt without requiring additional samples. It clearly instructs the model to think through the problem step by step, enhancing its problem-solving capabilities:

This is a simple and effective way to enhance the model’s reasoning abilities, akin to how the teacher helped Tom by breaking down and solving the problem step by step. When using large models, this approach serves as a tool to prompt the model to decompose and answer complex questions.

For the same MultiArith dataset, the paper experimented with various similar prompts, showing varying effects.

Thus, in practice, when employing the Zero-Shot CoT approach for reasoning tasks, it’s essential to experiment with various prompts tailored to the dataset’s characteristics.

Self-Consistency (SC): Multi-Path Reasoning + Voting Mechanism

Self-Consistency Improves Chain of Thought Reasoning in Language Models. The SC method, developed by the Google Brain team, enhances reasoning accuracy by generating multiple different reasoning paths for a given problem and voting for the most frequent answer among these paths.

In the SC method, for a single problem, multiple CoT results are generated. This is equivalent to having the model generate reasoning steps and answers multiple times. The final answer is obtained through majority voting. For example, if k = 3, the generated paths and answers could be 18, 18, and 26. Taking the majority yields 18.

This method excels in complex reasoning tasks; however, compared to the regular CoT approach, it requires more time and resources. Does this mean that the more samples are taken, the better the effect?

According to experimental results from the paper, the performance improvement of the SC method starts to plateau when the sampling count “k” ranges from 20 to 40 in various reasoning datasets. In most cases, the datasets tend to saturate at around 40 samples. However, conducting 40 samples requires significant resource consumption. Therefore, when using the SC method, it’s important to choose an appropriate sampling count based on the specific needs and available resources. This allows for a balance between effect enhancement and resource utilization.

Tree-of-Thoughts (ToT)：Multi-dimensional Thinking for Comprehensive Problem Solving

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

The ToT method differs from traditional CoT approaches. It allows models to consider multiple different reasoning paths concurrently. It involves evaluating multiple segments of reasoning processes and making global choices through foresight or backtracking when necessary. This results in the formation of a tree-like reasoning structure, as depicted on the right side of the following diagram:

Specifically, it comprises the following four stages:

Thought Decomposition Based on the nature of the problem, the problem is broken down into multiple intermediate steps. Each step could be a phrase, equation, or writing plan, depending on the problem’s characteristics.
Thought Generation Assuming that solving the problem requires “k” steps, there are two methods to generate the reasoning content for each step:

Independent Sampling: For each state, the model independently extracts “k” reasoning contents from the CoT prompts, without relying on other reasoning contents.
Sequential Generation: Sequentially using prompts to guide the generation of reasoning content step by step, where each reasoning content might depend on the previous one.

3. Heuristic Evaluation Heuristic methods are used to evaluate the contribution of each generated reasoning content to problem solving. This self-assessment is based on the language model’s self-feedback, achieved by designing prompts that allow the model to score multiple generated results.

4. Search Algorithm Based on the methods for generating and evaluating reasoning content, an appropriate search algorithm is chosen. For example, breadth-first search (BFS) or depth-first search (DFS) can be used to systematically explore the tree of thoughts, including foresight and backtracking.

Using the example of the 24-Game, where the task is to determine if four given integer values can be combined using +, -, ×, and ÷ operations to yield a result of 24, the ToT method can be applied as follows:

It can be divided into three steps, with each step representing an intermediate equation. For example, given the numbers 4 9 10 13, the following three steps solve the problem:

13 - 9 = 4 (left: 4 4 10);

10 - 4 = 6 (left: 4 6);

4 * 6 = 24 (left: 24)

2. In each step, the process of generating multiple candidates using a few-shot prompt method (as shown in (a) below) is used.

3. For each candidate in each step, the process described in (b) is employed to evaluate the candidates using the model. This involves evaluating whether the remaining numbers 10, 13, and 13 can yield a possible result of 24 pointsImpossible. The next step begins with candidates that have higher scores.

4. Steps 2 and 3 are carried out for each step to generate intermediate equations and perform evaluations.

5. A breadth-first search (BFS) is conducted to sample feasible solution paths (shown in green).

Regarding the effectiveness of ToT, taking the 24-Game as an example, when using the GPT-4 model as the base model, the performance of ToT is significantly superior to general CoT methods. In this scenario, SC and Few-Shot CoT achieve less than 10% accuracy on the task, while ToT achieves an accuracy of 74%:

However, in terms of task usability, employing the ToT method requires familiarity with the task and the ability to break it down into logical and manageable steps. Furthermore, it necessitates the design of corresponding generation and evaluation methods for each step of the task. Finally, DFS or BFS techniques are used for sampling solutions. Additionally, it’s crucial to have a strong grasp of following prompt instructions in the base model, much like the GPT-4 model used by the authors of the paper. If you can meet these requirements, ToT could serve as a powerful tool for solving complex problems.

Make your model even stronger

OpenCompass is a comprehensive evaluation platform for large models launched by the Shanghai Artificial Intelligence Laboratory. OpenCompass currently supports a range of CoT (Chain-of-Thought) techniques, including those mentioned earlier, from Zero-Shot CoT to Tree-of-Thoughts.

Leveraging OpenCompass’ extensive evaluation capabilities, you can effortlessly conduct diverse CoT evaluations on over 300,000 questions from 50+ datasets for more than 20 open-source large models and OpenAI API models. Below is an example of testing SC method on the GSM8k dataset using OpenCompass:

# Configuration for SC version of gsm8k test can be found in:
# opencompass.configs.datasets.gsm8k.gsm8k_gen_a3e34a.py.

gsm8k_infer_cfg = dict(
    inferencer=dict(
        type=SCInferencer, # Replace GenInferencer with SCInferencer
        # Set generation parameters to ensure diverse model outputs, currently applicable for models loaded from HuggingFace.
        generation_kwargs=dict(do_sample=True, temperature=0.7, top_k=40),
        sc_size=SAMPLE_SIZE # Number of SC paths to sample
    )
)

gsm8k_eval_cfg = dict(sc_size=SAMPLE_SIZE)

In addition to implementing these methods, OpenCompass has introduced some new features. For instance, while the official ToT repository currently only supports OpenAI API models, OpenCompass extends this support to common open-source large models. This makes it easy to experiment with customizations and classic datasets across models of different scales and types.

Below is a comparison of SC and ToT evaluation results obtained using OpenCompass:

OpenCompass aims to integrate the powerful tool of CoT to help the community unlock the immense potential of large language models across various tasks. With more researchers and practitioners joining the effort, AI technology is expected to become smarter, more efficient, and more practical in the future.

OpenCompass Project Link:

https://github.com/internLM/OpenCompass/

CoT Tutorial: https://opencompass.readthedocs.io/zh_CN/latest/prompt/chain_of_thought.html

Large Model Leaderboard: https://opencompass.org.cn/leaderboard-llm

Welcome everyone to submit evaluation applications on OpenCompass :