[LLM] o1 reverse engineering

Minwu Kim
2024년 9월 26일
35분 분량

최종 수정일: 2024년 9월 27일

CORE IDEA: Explicitly pre-define some "thought-factors" in human reasoning ("Decompose Problem," "Restate Objective," "Verify Result," "Correct Error," "Propose Hypothesis," and so on), so that it has a discrete action space defined. (Somewhat resembles the idea of self-discover paper)

The launch of OpenAI's o1 can be described as a groundbreaking event. Although there have been various rumors about Q*, Strawberry, and others for a long time, and the general direction of using reinforcement learning (RL) to enhance logical reasoning ability was more or less expected, the integration of LLM (Large Language Model) and RL to generate Hidden Chain of Thought (Hidden CoT) is something that very few people could have anticipated. Moreover, the current results seem quite promising.

OpenAI is increasingly moving towards a closed approach. If you look at the official announcement of o1, apart from "reinforcement learning generating Hidden CoT," there is essentially no other technically substantial content. Sora, at the very least, provided a rough technical framework diagram, and some hidden technical points are hinted at in the text. Careful readers can find many clues, which, when pieced together, reveal a faint outline of the underlying technology (if interested, you can refer to my previous analysis: "Demystifying Technical Mystification: A Reverse Engineering Analysis of Sora's Key Technologies"). Additionally, although there are many public papers using LLM + RL to enhance the reasoning ability of large models, almost none focus on generating Hidden CoT, leaving very little directly available for reference, which adds further difficulty in analyzing o1.

So, is it impossible to figure it out? Not necessarily. By paying closer attention to details and making some professional inferences, I believe traces can still be found. This article will attempt to analyze the technical principles of o1 in a relatively easy-to-understand way, aiming to answer the following questions:

Besides significantly enhancing complex logical reasoning abilities, what other important significance does o1 have?
What might be the general process for training o1?
Is o1 a single model or multiple models?
How is the RL state space in o1 defined? How is the action space defined? What type of reward model might be used? What kind of training data might be employed? What might the model structure look like after combining LLM and RM (Reward Model)? And so on.

For convenience, I will refer to the inferred model in this article as "Reverse-o1." Of course, some parts of this involve judgment based on evidence, while others are inferred based on mainstream technologies. As long as OpenAI does not officially announce the technical framework, there might be a thousand solutions proposed by a thousand practitioners. I mainly refer to the approach of AlphaZero and try to integrate LLM and RL based on this. Many of my views are subjective, so take them cautiously.

The Significance of OpenAI o1

I personally believe that OpenAI o1 is a major breakthrough in the field of large model technology. Apart from significantly enhancing complex logical reasoning abilities, it also has other important implications and value, which I will analyze below.

First, o1 introduces self-reflection and error correction capabilities to large models, which I think is particularly valuable. Models like GPT-4 output answers token by token, and when the output is lengthy, some tokens in the middle will inevitably be incorrect. Even if the LLM later realizes that the earlier output was wrong, it has to continue in the same vein (this is one of the sources of hallucinations in large models, where the LLM uses 100 errors to cover up the first mistake to make the logic seem reasonable), because it has no mechanism to correct previous errors once tokens have been generated.

However, in o1's "thinking," i.e., the process of generating Hidden CoT, if you analyze the examples of Hidden CoT provided on OpenAI's official website, you will find that it can indeed realize previous mistakes and automatically make corrections. This self-recognition and correction of errors are crucial for the LLM to engage in long-chain thinking and solve complex tasks, effectively surpassing a high threshold that previously limited LLMs.

Second is the so-called new RL scaling law. OpenAI's PR may emphasize this point more, and various interpretations also focus on this. I speculate that o1's RL probably used either a relatively complex tree search like AlphaGo's MCTS (Monte Carlo Tree Search) or a simple tree structure expansion, such as generating multiple candidates and selecting the best one (Best-of-N Sampling). If used consecutively, this strategy is also a kind of simple tree search structure. Both might have been used together. Regardless, it is highly likely that a tree search structure was used. Although CoT is linear, this is the output result and does not mean that the internal thought process is necessarily linear. I believe that solving complex problems is difficult with linear thinking alone; a tree structure is almost inevitable.

Some might ask, do you have evidence that o1 most likely used a search tree structure? I have no direct evidence, but I can infer it. My judgment is based on o1 mini. Previous research concluded that although small models have strong language abilities and decent world knowledge, their logical reasoning abilities are difficult to enhance. Even if you try to internalize logical reasoning into a small model's parameters through methods like distillation, the effect is limited. The largest gap between small and large models is in logical reasoning. This means that relying purely on parameter internalization to enhance the logical reasoning of small models may have limited impact. However, o1 mini is clearly a small model with very strong complex logical reasoning abilities. It also seems to be able to adjust its logical reasoning ability through configuration (the so-called inference-time scaling law). If you understand how AlphaGo operates, you will find that these are typical features of a search tree, where you can enhance the model's ability by controlling the search space size.

Although I personally think that if controlling how to expand the tree structure through setting parameters (such as controlling the width and depth of the search) can be called a scaling law, it might be a bit of a stretch. If so, we could say that the scaling law already existed when AlphaGo was introduced in 2016.

However, regardless of what we call it, it is undeniable that this method has excellent scalability. Whether during the RL training phase or the LLM's inference phase, simply adjusting parameter configurations to increase the width and depth of the tree search can enhance performance by increasing computational power. Its scalability is both flexible and effective. From this perspective, o1 is indeed of great significance because it proves that integrating LLM and tree search is a viable path, significantly raising the upper limit of what LLM models can achieve in terms of AGI.

Third, the rise of small models could truly become possible after o1. Small models have been quite popular in the past half year, but from the perspective of capability acquisition, they are still limited by an upper bound—logical reasoning ability. As mentioned earlier, small models are characterized by strong language abilities comparable to large models, relatively weaker world knowledge (though this can be improved by providing more data), and difficulties in enhancing logical reasoning due to their smaller scale.

Therefore, the optimization focus for small models is on world knowledge and logical reasoning. From the performance of o1 mini (weak world knowledge, strong logical reasoning), it can be inferred that we can develop small models using a "Divide-and-Conquer of Ability" (DCA) approach (as illustrated above): decoupling language ability, world knowledge, and logical reasoning. Language ability would rely on the small model itself, logical reasoning would depend on deep thinking abilities obtained through RL similar to o1, and world knowledge could be enhanced through an external RAG (Retrieval-Augmented Generation) system.

Through "Divide-and-Conquer of Ability," small models could potentially match the capabilities of the most powerful current large models. This effectively clears the obstacles for small models, making it much easier to develop them. Additionally, developing small models (SLMs) is relatively cost-effective, allowing many individuals and organizations to pursue this. Therefore, it can be predicted that this DCA model will become widespread, forming a new paradigm for developing small models.

Fourth, o1 may trigger a new paradigm in "safety alignment." In terms of safety alignment, o1 likely adopts a strategy similar to Anthropic's "AI Constitution." This means providing some safety guidelines to specify which behaviors are allowed and which are not. With the enhanced logical reasoning ability of o1, its adherence to these rules has also significantly improved, making its safety capabilities much stronger than GPT-4. This may inspire a new model for safety alignment: one can first strengthen the model's logical reasoning ability and then adopt an "AI Constitution"-like approach, as o1 has shown that this can greatly enhance a large model's safety capabilities.

Fifth, the "Reinforcement Learning + LLM" generalization ability may not be limited to scientific fields. Reinforcement learning is suitable for solving complex problems with clearly defined rewards, typically in subjects like mathematics, physics, and coding, where there are standard answers. Therefore, many people question whether o1 can generalize to broader fields. Indeed, whether o1's thinking ability can generalize to fields without clear-cut answers and difficult-to-quantify rewards is key to its development. If it generalizes well, it opens up vast possibilities; if not, its applicability will be limited.

I speculate that OpenAI may have already discovered some reward definition methods for non-mathematical disciplines and expanded this method to other domains through RL. Since o1 can understand and follow safety rules, using an "AI Constitution" approach to address safety issues, we can infer a method to assign rewards in fields with fuzzy standards: by writing textual standards or rules for those difficult-to-quantify fields, allowing the large model to understand and follow them to determine reward levels. For example, in writing an essay, one could list the standards of a good article (such as clear structure and elegant writing) and let the model assign rewards based on these standards. This could extend to many other fields.

Of course, implementing this might require a step-by-step approach: first use tasks with well-defined rewards in mathematics and physics to enhance the model's complex reasoning ability to a certain level, so it can then understand the rules, before moving to fields with more ambiguous rewards. (This is just my speculation, without direct evidence.)

From the analysis above, it is clear that the o1 technical direction not only enhances the model's complex logical abilities but may also trigger significant innovations in large model development, which is why I consider o1 so important.

Inference on the Complete Training Process of OpenAI o1

LLM models like GPT-4 are generally trained in two stages: "pre-training" and "post-training" (refer to the diagram above). The "pre-training" stage involves absorbing fundamental abilities such as language, world knowledge, logical reasoning, and coding through Next Token Prediction using large-scale datasets. The larger the model and the more training data it uses, the more capable the model becomes. What we usually refer to as the scaling law applies to this phase, representing the model's expansion characteristics. This stage is also the most computationally intensive part of LLM training. The "post-training" stage consists of SFT (Supervised Fine-Tuning), RM (Reward Modeling), and PPO (Proximal Policy Optimization), collectively known as Reinforcement Learning from Human Feedback (RLHF). The primary objectives of this stage are twofold: enabling the LLM to follow instructions for various tasks and ensuring content safety by preventing the model from generating inappropriate or impolite content. Once trained, the model performs inference by directly generating tokens one by one in response to user queries.

The entire training and inference process of OpenAI o1 should differ significantly from typical LLMs like GPT-4. First, the "pre-training" phase seems to involve retraining rather than continuing pre-training on GPT-4o. There is strong evidence for this: OpenAI has repeatedly emphasized that o1 mini has strong logical reasoning abilities but weak world knowledge. If it were a modification of another model, its world knowledge wouldn't be weaker than GPT-4o mini, which indirectly suggests that it was retrained from scratch. Additionally, this implies that models like o1, which emphasize logical reasoning, must have drastically increased the proportion of logic-related training data (e.g., STEM data, code, academic papers) during the pre-training phase. I even suspect that o1 mini might not have used general data; otherwise, there wouldn't be such a strong emphasis on its weak knowledge capabilities.

In the "post-training" phase, there should be a step aimed at enhancing the model's instruction-following capabilities, which means the RLHF stage is likely present. This is because o1 is not weak in following instructions, and the generated Hidden COT segments also contain a lot of instructional content. Weak instruction-following capabilities would likely have a negative impact on generating Hidden COT. Therefore, it can be inferred that this step probably occurs before the "thinking" stage (though the RLHF stage might not involve both RM and PPO). However, there are two important differences in this stage compared to GPT-4's RLHF phase: First, o1 likely did not address content safety at this stage, probably postponing it to a later stage (though it might have been handled in both stages). Second, this stage probably also significantly increases the proportion of instruction-following data related to logical reasoning, further enhancing the base model's logical reasoning ability, which we will explain in detail later.

The next stage is the most distinctive feature of o1: the so-called introduction of "System 2" slow-thinking abilities. ClosedAI only mentioned the use of RL reinforcement learning and nothing else, keeping technical confidentiality at a top level. Hence, we can only infer that o1 integrates LLM and RL to achieve a "think before speaking" ability in the model.

OpenAI o1 likely postponed the incorporation of "content safety" capabilities until after the "Think" phase, adopting a very different approach from GPT-4. After the Think phase, which further enhances its logical reasoning abilities, it does not use safety-related instruction data to adjust model parameters. Instead, several predefined safety rules, such as "do not output details on how to manufacture harmful products like drugs," are embedded in the system prompt. With its enhanced logical abilities, o1 can generally refer to the "safety rulebook" during output to avoid generating harmful content. This approach is similar to Anthropic's "AI Constitution" strategy (referenced in the diagram above). Additionally, o1's content safety abilities are much stronger than GPT-4o, indicating a significant future paradigm shift in large model safety: greatly enhancing the model's logical reasoning abilities first, followed by an "AI Constitution" or "AI Safety Manual" approach. Clearly, this would simplify content safety because the LLM is treated like a human; if you tell it what it can and cannot do, it can fully understand once its logical abilities are strong.

The above describes o1's training process. During the inference phase, o1 demonstrates a "think before speaking" approach, which consists of three stages: First, it thinks and generates Hidden COT data based on the user's prompt to reflect the thought process. Since many Hidden COTs are lengthy, a "COT summarization" stage is introduced to extract key thinking steps from the long Hidden COT and present them to the user. Finally, the model outputs an answer based on the COT.

From the above, it is evident that both the training and inference phases of o1 differ significantly from traditional LLMs. Additionally, I would like to elaborate on two aspects here.

First, one way to mimic the model to achieve a similar effect as o1 is to take a shortcut: neither specifically enhancing the base model's logical reasoning ability (e.g., by significantly increasing the proportion of logic-related data in pre-training) nor performing RL training in the "slow-thinking" phase (since we don't know how to do this). Instead, focus on adding a "Think" process during the model's inference phase, such as introducing the simplest tree expansion strategy like Best-of-N Sampling and crafting prompts that encourage the LLM to think and reflect. Combining these methods, the model can generate its own Hidden COT. This approach can improve the model's reasoning performance to some extent. However, the ceiling for improvement with this method is relatively low. That is, the model's logical reasoning ability may seem to improve at first, but it will soon hit a limit. Even increasing computational power during the inference phase (e.g., expanding the number of samples N from 10 to 50) will have little effect. This is essentially what the so-called "inference-time scaling law" might refer to. Would you consider this approach a law or not?

This conclusion comes from the papers "Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters" and "Are More LM Calls All You Need? Towards the Scaling Properties of Compound AI Systems." These studies demonstrate that for simple or medium-difficulty logical reasoning problems, increasing computational power during inference time, such as using tree search methods, has a more significant effect than enhancing the model's logical reasoning ability during the "pre-training" phase. However, for complex logical reasoning problems, relying solely on inference-time increases is challenging and can even have negative effects. In such cases, it is better to strengthen the model's logical reasoning abilities during the "pre-training" phase (refer to the diagram above).

Why is this the case? If you think about it, the reasoning behind this is quite straightforward. For simple or medium-difficulty problems, the model is likely to get most of the steps in the answer correct during inference (or in the majority of multiple samples). Only a few steps might be wrong, leading to an incorrect final answer. By using a simple tree search method like Best-of-N Sampling to increase output diversity, and adding a reliable verifier to filter out mistakes, it's relatively easy to correct these small errors. However, for high-difficulty logical problems, most of the steps in the model's output are likely wrong (or in the majority of multiple samples), making it futile to rely on increasing computational power during inference-time.

This line of thinking led me to further infer the potential training process of o1: the base model of OpenAI's o1, whether in the pre-training or post-training phase, most likely greatly enhances the model's complex logical reasoning capabilities. This enhancement serves as the foundation for using increased computational power during inference-time to solve complex problems.

Therefore, the conclusion on this point should be as follows: relying solely on increasing computational power during inference-time is only effective for simple and medium-difficulty logical problems. To continuously improve the model's complex reasoning abilities, it is necessary to invest efforts during the pre-training and post-training phases.

At this point, some might ask: "I don't have the money to train a base model myself! What can I do?" This is a problem most people face. In fact, adopting existing methods should be feasible, but you need to choose base models with strong logical reasoning abilities. I estimate that code-related base models might be relatively suitable. Then, you can try to do some work in the "Think" training and "Think" inference stages. This approach seems possible and doesn't require computational power beyond what most people can handle.

The second aspect to discuss is actually related to the first. I see many people, after learning about o1, claim that the scaling paradigm has changed: as long as you scale the computational power during inference-time, the model's reasoning performance will continue to scale. It is evident that this is a misconception. As explained earlier, if you only focus on expanding computational power during inference-time, the model's performance ceiling will not be very high. Ultimately, it is still necessary to expand the model's complex logical reasoning capabilities during the pre-training or post-training phases. At the very least, these phases are complementary and mutually reinforcing. Focusing solely on inference-time scaling is most likely incorrect.

o1 is Composed of Multiple Models

From the o1 System Card, it is clear that o1 consists of at least two models: one main model and a relatively independent "Hidden COT summarization model" (refer to the diagram above). The purpose of this summarization model is to provide a concise and content-safe summary of the Hidden COT based on the user's input and the generated Hidden COT. Therefore, o1 is composed of at least two models.

The question then arises: are there other models besides the main model and the summarization model? I think there is a high probability that there are additional models.

We can analyze this from the pricing of o1. Currently, it is known that the input cost of o1 Preview is three times that of GPT-4o, and the output cost is four times higher. The input and output costs of o1 mini are both 20 times that of GPT-4o-mini (refer to the diagram above).

Here, let me explain why the input and output costs of large models differ. This is because, during the inference stage (as opposed to training) of large models, there are two phases: Prefill and Decoding (refer to the diagram above). In the Prefill phase, the user's input prompt is processed using parallel computation to generate the Self-Attention Key-Values for each token, which are then stored in the KV Cache for use in the Decoding phase to compute Self-Attention when generating each token. In this phase, the Key-Values for each token can be calculated in parallel, allowing the model to output multiple tokens' Key-Values in one run, thus achieving high GPU utilization. During the Decoding phase, the model generates the subsequent content based on the user's prompt, but the model can only produce one token per run. This limitation prevents the effective utilization of GPU parallel computing, leading to insufficient resource use. This difference in resource utilization is why the cost of output is generally three to four times higher than the input cost in large models.

Returning to the core issue regarding the pricing: why is the input cost of o1 mini (where the logic for processing inputs in large models is relatively simple, only generating the KV Cache corresponding to the prompt) 20 times higher than that of GPT-4o mini? This is quite puzzling and likely holds many clues about o1's internal mechanisms. The input cost corresponds to the Prefill phase, which, in principle, only processes the user's input prompt and system prompt. There are only two possible reasons for the Prefill phase being 20 times more expensive:

One possibility is that the input length of the user's prompt plus the system prompt is 20 times that of GPT-4o mini's input. The user's input prompt is the same for both o1 and GPT-4o, so the only way to increase the input this much is for OpenAI to have inserted a lot into the system prompt. However, considering that o1 Preview is only three times more expensive than GPT-4o, the likelihood of an extremely lengthy system prompt is low. Otherwise, the input cost of o1 Preview should also be 20 times higher than GPT-4o. Furthermore, if the cost increase were solely due to the system prompt, the cost of o1 during the Decoding phase should not be so high, as the system prompt primarily affects the computational cost during the Prefill phase. (Of course, if the KV Cache is longer, the computational load during Decoding for Self-Attention calculations would also increase because the model would need to reference a longer context, but with current large models' support for long texts, the cost wouldn't be significantly higher.)

So, regarding this possibility, the conclusion is that OpenAI might have added content to the system prompt (likely the "safety rulebook" mentioned earlier), but this alone does not account for a 20-fold price difference. Thus, the reason for inserting a very long system prompt cannot explain the 20-fold cost disparity.

Let's consider the second possibility. If we assume there is not much difference in the input prompt length, then from the perspective of Prefill computation, the only way to understand this is that the total parameter count of the o1 mini model is about 20 times that of GPT-4o (beyond the summarization model, there is a difference of more than tenfold). Here, we encounter two possibilities. One is that o1 mini consists of a single model, meaning its parameter count is roughly 20 times that of GPT-4o. This is clearly not the case, as various tests show that o1 mini runs quite fast, so a single model cannot be that large.

The only remaining explanation is that, besides the main model and the summarization model, o1 mini has around 18 other models of similar size. Given that the input cost of o1 Preview is only three times that of GPT-4o, it can be understood that o1 Preview has, in addition to a main model and a summarization model, another model of similar size. Since the overall working mechanism of o1 Preview should be similar to that of o1 mini, it can be inferred that the additional model in o1 Preview and the 18 extra models in o1 mini (why don't these 18 models share a KV Cache? If they could share a KV Cache, then at least during the Prefill phase, the 18 models could be reduced to one model, so the input cost of o1 mini would only need to be three times that of GPT-4o mini. The fact that it is still 20 times more expensive indirectly suggests that these 18 models have differences in their parameters or configurations, preventing them from sharing a KV Cache) are probably doing the same type of work. This work is flexible in terms of the number of models deployed; you can configure whether to deploy 1, 5, or 18 models of this type. Additionally, these models have some differences that prevent them from sharing a KV Cache. The fact that o1 mini performs better than o1 Preview in many scenarios may be related to the number of these models deployed. Therefore, I speculate that these models are most likely related to tree search.

In summary, the o1 model is probably composed of three parts (refer to the diagram): a main model, a summarization model, and a flexible pool of models related to tree search. If we intend to reverse engineer o1 and propose a technical solution, this solution would need to meet these constraints, include these models, and explain the operational mechanism of this model pool.

Training Data That OpenAI O1 Might Have Used

Manually Annotated Data

First, the training of o1 would certainly involve manually annotating a batch of COT (Chain of Thought) thinking processes. This means obtaining a set of `<question, answer>` data and having humans write down the problem-solving thought process and steps, resulting in `<question, thought process (including mistakes made and corrected during the thinking process), answer>`. Without manual annotation, the presence of phrases like "Hmm, wait, ..." in the COT would imply that the LLM generated them independently, suggesting a form of consciousness, which is very unlikely. Most likely, these expressions originally come from manually annotated data. This data can be used to fine-tune the initial o1 model with Supervised Fine-Tuning (SFT) to familiarize it with this mode of expression. However, SFT alone would not be sufficient.

Synthetic Data

Manual annotation is difficult and expensive, so the quantity of manually annotated COT data will be limited. Manual annotation suffers from poor scalability, though it does offer high-quality data. To expand beyond this, a synthetic data approach can be employed. One straightforward method of creating synthetic data is similar to the method mentioned earlier for creating PRM (Prompt-Ranked Modeling) annotated data: extract a segment of the manually annotated COT and then use Monte Carlo Tree Search (MCTS) to complete the subsequent reasoning process. Each segment is run multiple times, with some leading to the correct answer and others not. Regardless of whether the final answer is correct or incorrect, both can be used as synthetic data to train the o1 model. A more aggressive approach would involve using a trial-and-error method directly to search for the correct answer to logical problems with a known standard answer. The correct and incorrect answers found through this search can be used to train the o1 model (though this might essentially already be o1 itself, so this possibility is less likely).

Reverse Generation of Code COT Data

There is a way to greatly expand the COT data for code: we have a large number of existing codebases that can teach the large model to attempt to reverse-generate the reasoning steps of the Hidden COT from the code. This approach seems feasible and could significantly expand COT data for coding tasks (this idea is inspired by "Planning in Natural Language Improves LLM Search for Code Generation").

Reverse Generation of Mathematical COT

Powerful mathematical problem-solving systems like AlphaProof generally follow the strategy of first translating natural language mathematical problems into formal language descriptions using a model, then using a system like Lean, combined with an AlphaZero-like approach, to iteratively search and verify intermediate reasoning steps through reinforcement learning and tree search. This method has proven effective and can currently achieve the level of an IMO (International Mathematical Olympiad) silver medalist (refer to the blue section in the diagram). However, the limitation of such systems, which require translating problems into formal language before solving them, is their poor generalizability; they are mostly restricted to solving mathematical problems and are hard to extend to other fields.

Inspired by the reverse generation of code, I believe we can also reverse-generate mathematical COT (refer to the red section in the diagram). Since AlphaProof can build a neural network that translates natural language problems into formal mathematical language, it is also possible to construct a reverse-generation model that translates formal mathematical language into natural language. Using this translation system, we can convert AlphaProof's problem-solving reasoning process from formal language into natural language COT. Mistakes made along the way can also be utilized because AlphaProof has a clear validation system. Even if a step is wrong, the system knows why it is wrong, and this can also be translated into natural language. In this way, it is possible to generate tens or even hundreds of millions of pieces of data representing the mathematical reasoning COT thought process. I believe this idea is generally feasible.

Currently, OpenAI o1 performs best in mathematics and coding, indicating that these areas have the most training data. While I am not sure whether they adopt a reverse-generation approach to automatically construct COT data, this method appears to be quite feasible.

Reverse-o1: Key Elements of RL and How to Integrate RL with LLM

Let's start deducing how o1 might integrate RL (Reinforcement Learning) with LLM and call the inferred model "Reverse-o1."

We will first analyze the key elements of RL in the Hidden COT (Chain of Thought) scenario: State Space, Action Space, and Reward Model. As for the RL method, I speculate that it is highly likely to adopt an approach similar to AlphaGo/AlphaZero, for several reasons:

First, it is said that OpenAI employees read Sutton's "Bitter Lesson" several times a day. This work mentions that "general methods that leverage computation, such as search and learning, will ultimately succeed." The "search" here mainly refers to DeepMind AlphaGo's MCTS (Monte Carlo Tree Search) method. Given this background, it would be hard to imagine OpenAI employees not applying search methods in their practice.

Second, in the recent video interview featuring key members of the o1 team released by OpenAI, one of the team members mentioned that they have always been exploring how to integrate AlphaGo's search methods with LLMs, which serves as further evidence.

Therefore, we will briefly introduce the working principles of AlphaZero and attempt to integrate it with LLM to construct a complex logical reasoning system.

The State Space of RL in O1: A Continuous State Space Composed of Token Sequences

Regarding the state space of RL in o1, the first question is: Is this state space discrete or continuous? Most likely, it is a continuous state space, or at least it is best to consider it as such. O1 combines LLM and RL, and when a user inputs a question, it is natural to view the sequence of tokens that form the question as a whole, which can be seen as the first state (State1). The token sequence of State1 serves as the input to the o1 model. In the action space, o1 selects a certain action (we will discuss how the action space is defined later). Regardless of what this action is, after choosing it, o1 will output a segment of the token sequence (it will not be the complete Hidden COT, but rather a segment of it). Then, o1 appends the newly generated Hidden COT segment to State1 to form State2, which then becomes the new input for o1. Based on this new input, o1 selects a new action and outputs a new token sequence segment. This process repeats until the Hidden COT output is completed. This is the general idea of the process.

The state space of RL in o1 is unlikely to consist of discrete states because it is difficult to clearly divide it into several specific states. One could argue that, in an extreme case, each token forms a discrete state in the state space. However, this is practically infeasible. If each token represents a state \( S \), the state combination space becomes exceedingly large. Assuming a token vocabulary size of 100,000, the combination space for 2 tokens would be the square of 100,000. For a token sequence of length \( n \), the state space would be 100,000 raised to the \( n \)-th power—an astronomical number.

Furthermore, for o1’s RL, choosing an action \( A \) and generating the next token representing a transition to another state \( S' \) after each token input would be required. If the RL process involves search, it means a search would need to be performed for every token. From many examples of o1 available online, we can see that the Hidden COT can be very long, sometimes reaching tens or even hundreds of kilobytes. The computational load in this case would be practically unacceptable. Therefore, viewing each token as a discrete state is not impossible, but the granularity is too fine to be realistically applicable.

I believe it is more appropriate to consider o1's state space as a continuous state space composed of token sequences. Although the example above mentioned State1 and State2, which seem like discrete states, this was merely for the sake of explanation (of course, if we view State1 as a point sampled in the vast token combination space, that would be fine). This is similar to how RL (Reinforcement Learning) plays video games or Go. The input game (or Go) screen in RL consists of, say, 1024×1024 different pixels (where different pixels can be compared to the various tokens in an LLM). Due to the enormous combination space of pixels, it is hard to define what constitutes each discrete state. Therefore, in RL for gaming or Go, the input image is typically treated as a whole, seen as a continuous state space, which is then mapped to a specific action through a neural network. The state space in o1 is similar to this (refer to the diagram). One can compare a token segment to a game image input in RL, treating it as a continuous state space composed of token sequences. This is then mapped to an action in the action space through o1's LLM+RL neural network.

From the above analysis, we can see that the RL techniques used in gaming or Go mostly take a continuous state space as the network input, while the output is usually a specific action in a discrete action space. Thus, it is clear that these RL techniques are well-suited to be part of o1's RL solution. In contrast, RL models that employ discrete state spaces, such as MDP (Markov Decision Process) methods, are not as suitable.

The Possible Action Space in O1's RL: Discrete Action Space of "Thought-Factors"

The most crucial aspect of O1's RL (Reinforcement Learning) technique is likely how the action space is defined. The process of generating the Hidden COT (Chain of Thought) in OpenAI O1 essentially involves making the machine imitate the thinking process humans use when solving complex problems. When humans think through complex problems, they employ relatively fixed and not-too-numerous "thinking patterns," or what we might call "thought-factors." For instance, when faced with a complex problem, we generally start by clarifying the problem's objective. Then, we break the complex problem into a series of steps or stages. To find the solution for a particular step, we might put forward a hypothesis and then verify whether it holds. If it does not, we continue to make new hypotheses until we solve that subproblem. During the process, we may also perform checks and find mistakes in some intermediate steps, which we then correct.

If we closely examine the examples of Hidden COT provided on OpenAI's official website, we can identify some typical "thought-factors" that represent the implicit thinking process of humans when solving problems (refer to the diagram, where I have listed some specific examples). If we were to treat the Hidden COT as a sequence constructed token by token, it would be very challenging to implement RL (since the state space of Hidden COT is already continuous and non-discrete, and if the action space is also non-discrete or too large in combination, RL becomes difficult to model). Therefore, in my opinion, one reasonable method is to identify the implicit "thought-factors" that humans use to think through complex problems and use these as the candidate action set. Examples of such thought-factors might include "Decompose Problem," "Restate Objective," "Verify Result," "Correct Error," "Propose Hypothesis," and so on. The total number of these thought-factors should not be very large; even if subdivided in detail, there would likely be only a few dozen to a hundred at most.

For each specific "thought-factor," the model could generate token segments that match the corresponding probability distribution. For example, if the action is the "Propose Hypothesis" factor, the likelihood of generating the token "Alternatively" would be relatively high (as learned through PPO—Proximal Policy Optimization—from training data). Thus, the true form of the Hidden COT segment might look something like this:

<ACT_Proposer-Start> Alternatively, perhaps combine the numbers in some way.
<ACT_Proposer-End> (Propose Hypothesis)

<ACT_RephraseTarget-Start> Overall Task: Write a bash script that takes one argument (the string representing the matrix) and outputs its transpose in the same format.
<ACT_RephraseTarget-End> (Restate Objective)

In other words, the original content or training data of OpenAI's Hidden COT might take the form of a two-level structure like this:

<Think_Start> (Start of Hidden COT)

……

<ACT-1_Start> token token token….. <ACT-1_End> (Thought-Factor 1)

<ACT-2_Start> token token token….. <ACT-2_End> (Thought-Factor 2)

<ACT-3_Start> token token token….. <ACT-3_End> (Thought-Factor 3)

……

<ACT-n_Start> token token token….. <ACT-n_End> (Thought-Factor n)

<Think_End> (End of Hidden COT)

This hierarchical Hidden COT structure embodies the combined advantages of RL and LLM. In a discrete action space, estimating which action to take given a state S (i.e., estimating the function Q(S, A) ) is something RL excels at. Meanwhile, generating tokens within the “thought-factor” tags is where LLMs shine. LLMs can learn to adjust the token generation probabilities inside the factor tags based on the type of “thought-factor.” The diagram above illustrates how o1 might operate using this two-level “thought-factor” discrete action space.

During the generation of the Hidden COT, both input and output are accompanied by the start and end symbols of the ACT action tokens. First, o1 predicts the next most likely “thought-factor” based on the current problem and the already generated Hidden COT segment, determining the specific thinking mode to adopt. Then, guided by this “thought-factor,” the LLM generates the specific token sequence, using the end token of the “thought-factor” as a marker to indicate the end of this mode of thinking. The token sequence output in this step is incorporated into the input, and the process repeats to generate the next action and corresponding token sequence. (Of course, this entire process is my conjecture and lacks concrete evidence.)

You might ask, “Why can’t I see the start and end tokens corresponding to the ‘thought-factors’ in the provided Hidden COT examples?” It is possible that the COT displayed to users is a filtered version. Consider this: the start and end tokens of the Hidden COT (<Think_Start>/<Think_End>) are highly likely to exist, yet you don’t see them, do you? This suggests that the output is a filtered version of the COT. Therefore, the original version may have contained “thought-factor” tags, but they were filtered out before being displayed. This is certainly a possibility.

The Reward Model in O1's RL Model

How the reward is set is crucial for RL (Reinforcement Learning). There have been quite a few academic studies on LLM+RL, and if we summarize them, there are currently two commonly used reward models (refer to the diagram): the Output Reward Model (ORM) and the Process Reward Model (PRM).

ORM involves training a model that only scores the final result, regardless of how many steps the derivation process takes. In the context of the Hidden COT, this means that the reward signal is provided only after o1 completes the entire Hidden COT. If the model's result matches the standard answer, it receives a reward of 1; if the answer is incorrect, it receives a reward of -1, or something similar. The advantage of ORM is that the feedback signal is accurate; for example, in a math problem, the model either solves it correctly or incorrectly, which is clear-cut, so the feedback signal is precise. However, the disadvantage of ORM is sparse feedback, meaning that feedback signals are few and far between. This is quite intuitive: even if the derivation process takes up 10 pages, there is only one feedback signal at the end. (The RM model used in the RLHF phase during OpenAI's large model training belongs to ORM.)

PRM, on the other hand, involves training a model that provides feedback signals at every intermediate step. This way, mistakes in the derivation process can be identified at each step without waiting for the end. Thus, its characteristic is rich, non-sparse feedback. However, the problem is that training a PRM requires annotated data for every step. Where does such extensive annotation data come from? The conventional method is manual annotation. For example, last year's well-known OpenAI PRM study, "Let’s Verify Step by Step," involved manually annotating 800,000 intermediate steps in the derivation process of math problems, and it was proven that PRM performs better than ORM. Therefore, the advantage of PRM is that it provides abundant feedback and performs well, but the cost of creating training data is extremely high and not feasible for most people.

So, is there a relatively low-cost method for annotating each step? Yes, there is. One of the better methods I've seen (refer to the diagram) works as follows: Assume we have a batch of math problems with complete derivation processes. We start by copying the first step of the solution, then use MCTS (Monte Carlo Tree Search) to continue deriving the next steps. From this point, multiple derivations can be carried out, with some leading to correct answers and others to incorrect ones. In principle, the higher the proportion of correct answers among the multiple derivations starting from this step, the more important the copied step is for reaching the correct answer, and therefore, it can be given a high score. Then, we copy the second step of the solution and process it similarly. This way, we can automatically annotate the quality of each derivation step. We can then use this data to train the PRM model, allowing the PRM to score each reasoning step. However, it is clear that the precision of PRM scoring trained with this method will certainly not match that of ORM.

So, which reward model would OpenAI o1 use during training: ORM or PRM? I estimate that both will be used. ORM provides precision, while PRM offers rich feedback, and combining the two should yield better results. Additionally, the o1 official website mentions, "Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data-efficient training process." The term "data-efficient" here likely refers to PRM.

The Basic Principles of AlphaZero

Here, we will first introduce the basic working principles of AlphaZero. The Reverse-o1 solution we will discuss later focuses on how to integrate RL with LLM, and its main framework largely references the core ideas of AlphaZero. Therefore, explaining AlphaZero here will help with understanding the subsequent content.

At the end of 2017, AlphaZero, the general version of AlphaGo for board games, was released. Not only did it dominate Go, but it also overwhelmingly defeated the strongest AI programs, including AlphaGo, in other board games like chess and shogi.

Technically, AlphaZero did not introduce any fundamental improvements compared to AlphaGo. It still relied on a structure of Monte Carlo Tree Search (MCTS) combined with neural networks and an RL training method. However, the implementation was much simpler and more elegant (refer to the diagram). There were two main changes: first, AlphaZero merged AlphaGo's two prediction networks (the policy network \( P \) and the value network \( V \)) into a single network that produced two types of outputs, \( P \) and \( V \). The policy network \( P \) was primarily used to predict the winning rate of each possible action (i.e., possible moves) in the current state, defined as the function \( P(S, a) \). Meanwhile, the value network \( V \) was used to evaluate the overall probability of winning the game from the current state \( S \), defined as the function \( V(S) \), which is a value between 0 and 1. The higher the value, the greater the probability of winning starting from the current state \( S \). The second change was upgrading the network structure from CNN to ResNet.

AlphaZero abandoned learning from human games entirely. It started from scratch and learned through self-play. In just three days of self-play, it acquired a level of Go experience far beyond what humans had accumulated over millennia.

AlphaZero combined MCTS and RL, where MCTS was the main component, and RL accelerated the search process. During self-play (refer to diagram a), the AI player uses MCTS to explore the potential actions (moves) in the current state. After searching each possible move, it obtains a probability distribution \( \pi \) of winning for each move. The AI player then selects the move with the highest probability to play. The other AI player adopts a similar strategy, and they take turns until a winner is determined (providing a reward signal).

So far, it might seem like we haven't seen the role of the neural network structure. In fact, the network plays a crucial role when MCTS searches for a move. The search space from any given move is enormous, making brute-force search infeasible. The policy network \( P \) and value network \( V \) (now merged into one network in AlphaZero; they are described separately here for clarity) guide the search process, prioritizing paths with a higher probability of winning and pruning those with a lower probability. This dramatically increases search efficiency.

During the search process, the neural network parameters remain fixed. Once the game ends and a winner is determined, training data corresponding to each state \( S \) encountered along the game path can be generated. For the policy network \( P \), the learning objective is the probability distribution \( \pi \) obtained by MCTS from the given state \( S \). For the value network \( V \), the goal is to learn from the fact that the final winner \( z \) has a higher winning probability. The neural network parameters are then adjusted based on this training data, enhancing the model's performance in the next round of play (it is clear that AlphaZero's reward model is an ORM). Through repeated self-play, AlphaZero continues to grow stronger.

It is important to realize that, at its core, AlphaZero is essentially still a Monte Carlo Tree Search (MCTS) system. The difficulty in Go lies in the vast search space, which makes pure brute-force search impossible. If we assume an infinitely powerful machine that can rapidly traverse all search spaces, then simply using MCTS alone, without relying on RL, could also achieve perfect gameplay. Through self-play and deep reinforcement learning, AlphaGo Zero primarily improved its ability to evaluate board states (\( V \)) and move quality (\( P \)), prioritizing moves along paths with a higher chance of winning. This allowed it to discard a vast number of inferior paths, significantly reducing the search space. Its self-evolution mainly manifests in increasingly accurate evaluations of the board state (\( V \) and \( P \)), enabling it to quickly identify the move with the highest probability of winning. The reason it was able to generate a large amount of training data through self-play is that Go is a well-defined game. At a certain state, it can result in a win or a loss, although this win or loss might not be apparent with every single move.

Reverse-o1 Model Network Structure After Integrating LLM and RL

One difference between o1 and board games like Go is that, in addition to RL, even the Hidden COT relies on token-by-token output, with the LLM (Large Language Model) still serving as the primary structure. However, RL also needs a network structure to adjust model parameters gradually, allowing it to learn the internal thinking process. Thus, the first problem we face is how to integrate the LLM and RL models to form a complete network structure that combines the functionalities of both LLM and RL.

The diagram above illustrates a structure I envision: the main component is still a Transformer-based LLM model (either Dense or MOE; the mini version should use a Dense structure). When the input is a "question + the already generated part of the Hidden COT" (i.e., the current state \( S \) composed of a continuous token sequence), the GPT network encodes this current state. On top of the LLM output head, two substructures can be differentiated: one for the conventional LLM to predict the next token, which aligns with the usual LLM process; and another for constructing the RL model structure. Here, we refer to AlphaZero's approach, using a single network to produce two outputs. For example, an FFN (Feedforward Neural Network) structure can be used to output the results of the policy network \( P(S, A) \), representing the probability distribution of the next action "thought-factor" in the current state \( S \). The higher the probability of a "thought-factor," the more likely this action will be selected and executed next. At the same time, it outputs the value network \( V(S) \), indicating the probability of the current state leading to the correct final answer. The greater the probability, the higher the quality of the current state \( S \), implying that the overall quality of the Hidden COT generated so far is high.

At this point, when the Hidden COT is in a certain state \( S \), the network can determine what action should be taken next and the probability of the current state leading to a successful answer. However, one part is still missing: the sequence of tokens to be output in the Hidden COT after the "thought-factor" action is known.

A simple method is to use the LLM component above the LLM head to continue generating the subsequent tokens (when using human-annotated training data, PPO can be used to increase the output probability of the corresponding tokens). During the token generation process, the output from the RL is not considered until the LLM outputs `<ACT_i-End>`, at which point the RL output is used to select the next action. This process continues, allowing the integrated model of LLM and RL to generate the Hidden COT.

Previously, we analyzed that o1 would likely use the Process Reward Model (PRM) and that it might be composed of multiple models. Under these two constraints, the above model structure can be modified as follows (refer to the diagram): After determining the next "thought-factor," the main model does not generate the subsequent tokens. To improve the quality of the generated COT, a Best-of-N Sampling approach can be used. Multiple copies of the Reverse-o1 model (with different replicas set to different temperature parameters to increase output diversity) each generate a token sequence. Then, an offline-trained PRM serves as the judge to score the sequences and select the one with the highest score as the output tokens for this "thought-factor." Once the best content is chosen, it can be synchronized with the main model. The main model performs an operation similar to Prefill, synchronizing the output of the best content before proceeding to the next round of output. This method clearly enhances the quality of the generated token sequence.

Here is the translation of the text:

Reverse-o1 Under MCTS Tree Search

Following AlphaZero, we introduce the main structure, MCTS (Monte Carlo Tree Search). Its operation process is as follows (refer to the diagram): When the user inputs a question, Reverse-o1 uses an MCTS tree to search for each possible "thought-factor." During the search, the policy network \( P \) and value network \( V \) are used to quickly identify the optimal search path, obtaining the probability distribution \( \pi \) for all "thought-factors." The larger the probability value, the higher the likelihood that adopting this line of thinking will lead to the correct answer. The "thought-factor" with the highest probability is then selected as the action for the current state. As described in the previous section, Reverse-o1 generates the corresponding COT (Chain of Thought) token segment for this action. This COT token segment is then merged with the user’s question, forming a new state. This process repeats until an answer is generated. The answer is then compared with the standard answer to determine whether it is correct or incorrect, thereby providing the corresponding output reward.

Following AlphaZero, when searching for the probability that a "thought-factor" from state \( S \) leads to the correct answer, the optimal next state \( S' \) is found using the max(\( Q + U \)) method. The \( Q \) function is positively correlated with the value network \( V(S) \), and the \( U \) function is positively correlated with the policy network \( P(S, A) \). Therefore, the meaning of max(\( Q + U \)) is to use the guidance of the value network and policy network to find high-quality search paths. When a leaf node is reached, node expansion occurs, and the policy and value networks are used to estimate and initialize the relevant search parameters. The \( Q \) functions corresponding to all states along the optimal path are then updated from the bottom up. After a round of searching for each candidate "thought-factor," the probability distribution \( \pi \) of all actions is obtained, completing the search step.

The difference between O1's search and that in board games is that, to transition to the next state, it also needs to generate the corresponding Hidden COT tokens based on the currently selected action. This step can be completed using the Best-of-N Sampling strategy described earlier.

Starting from the initial question, Hidden COT segments are generated step-by-step until the answer phase is reached, where an output reward is obtained, completing an MCTS search process that passes through the intermediate steps to the answer. If the answer is correct, the reward can be set to 1; if incorrect, it can be set to -1. Based on this, training data for all intermediate states \( S \) encountered along the path to the answer can be constructed to train the policy network \( P \) and the value network \( V \). The learning target for the policy network is the action probability distribution \( \pi \) obtained from the MCTS search for the corresponding state, while the learning target for the value network is the output reward.

Additionally, for each "thought-factor" selected during the search process, the corresponding Hidden COT token sequence obtained through Best-of-N Sampling (which can also be assigned a process reward score by the PRM) can be used to adjust the LLM model parameters using PPO (with the PRM's reward as the PPO's reward). This adjustment increases the generation probability of these tokens when the LLM encounters this "thought-factor" in the future.

At this point, we can conclude our journey of reverse-engineering o1. The constraints mentioned earlier (e.g., o1 should be composed of multiple models, it should use some form of tree search, both PRM and ORM should be used, etc.) have been largely reflected in the envisioned Reverse-o1 model.

However, there is one more issue that I believe is worth pondering: Are "thought-factors" absolutely necessary? After all, identifying these requires manual efforts to infer human thinking patterns, leaving a considerable trace of human involvement, and potentially increasing the cost of manually annotated data. I have contemplated this issue for several days, and my conclusion seems to be: The overall framework can remain as it is, and it might still work without introducing "thought-factors." However, as this article is already lengthy and the explanation is somewhat complex, I will omit it here for now.

Minwu Kim

댓글