How 'Chain of Thought' Makes Transformers Smarter

Large Language Models (LLMs) like GPT-3 and ChatGPT exhibit exceptional capabilities in complex reasoning tasks such as mathematical problem-solving and code generation, far surpassing standard supervised machine learning techniques. The key to unlocking these advanced reasoning abilities lies in the chain of thought (CoT), which refers to the ability of the model to generate intermediate reasoning steps before arriving at the final answer, kind of like how we humans break down a complex problem into smaller steps in our head. This can be achieved through methods like training the model on examples enriched with intermediate reasoning steps or using few-shot prompting to instruct the model to generate a CoT.

Now, you might think that the contents of these intermediate steps is what allows the model to reason better. But interestingly, in this study, the researchers found that even if the intermediate steps are incorrect or completely random, just the act of generating them still helps the model a lot. It’s like the model is being told “Okay, think this through step-by-step” and that alone improves its reasoning ability drastically.

[LLM Workshop] Learn how to speed-up LLM development with synthetic data at Gretel's virtual workshop on May 15th

So the researchers wanted to understand why this “chain of thought” approach is so powerful for transformers (the type of model used in GPT-3, etc). They used concepts from circuit complexity theory and adopted the language of computational complexity classes like NC, AC, and TC to analyze this problem.

Essentially, they found that without the chain of thought, transformers are limited to efficiently performing only parallel computations, meaning they can solve problems that can be broken down into independent sub-tasks that can be computed simultaneously.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...

However, many complex reasoning tasks require inherently serial computations, where one step follows from the previous step. And this is where the chain of thought helps transformers a lot. By generating step-by-step reasoning, the model can perform many more serial computations than it could without CoT.

The researchers proved theoretically that while a basic transformer without CoT can only solve problems up to a certain complexity level, allowing a polynomial number of CoT steps makes transformers powerful enough to solve almost any computationally hard problem, at least from a theoretical perspective.

To back up their theory, they also did some experiments on different arithmetic tasks – ones that can be parallelized and ones that inherently require sequential computations. Sure enough, they found that transformers struggled on the sequential tasks without CoT, but enabling CoT drastically boosted their performance, especially when the transformer model was relatively small/shallow.

In essence, the chain of thought is a simple but powerful trick that vastly increases the reasoning capabilities of transformer models like GPT-3. It allows them to tackle complex tasks requiring sequential logic that parallel models would fail at.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 42k+ ML SubReddit

Language models, a subset of artificial intelligence, focus on interpreting and generating human-like text. These models are integral to various applications, ranging from automated chatbots to advanced predictive text and language translation services. The ongoing challenge in this field is enhancing these models’ efficiency and performance, which involves refining their ability to process & understand vast amounts of data while optimizing the computational power required.

A significant challenge in natural language processing is the efficient scalability of language models to handle increasingly complex tasks. This includes improving their speed, accuracy, and ability to interact in a human-like manner without escalating computational costs. Researchers continuously seek methods to refine these models, making them more adept at understanding the context and subtleties of language.

Traditionally, language models undergo extensive pre-training on massive datasets, including everything from literary works to internet text. This training is designed to equip the models with a broad understanding of language & context. The next phase typically involves fine-tuning more specialized datasets to adapt the model for specific tasks, such as legal document analysis or conversational interfaces.

One pivotal aspect of this research is the introduction of the Buzz dataset by Alignment Lab AI, in collaboration with Hive Digital Technologies, a meticulously curated collection used to train the new model. This dataset encompasses a variety of text sources and is designed to provide a comprehensive foundation for model training. Notable for its volume and diversity, the Buzz dataset includes over 85 million conversational turns pulled from 435 unique sources. This extensive compilation allows for nuanced training processes that significantly improve the model’s ability to generate contextually relevant and syntactically diverse text.

The new methodology employs an innovative approach to this fine-tuning phase. The research team has developed an iterative fine-tuning process that reuses existing pre-trained models and enhances their performance through strategic modifications. This process involves adjusting the models based on feedback from their performance in specific tasks, effectively allowing the model to ‘learn’ from its outputs.

The essence of this approach lies in its use of iterative cycles of feedback and adjustment, which significantly reduce the need for re-training from scratch. This method utilizes distributions of “grounding” data collected from previous epochs phases of the model’s training, which guide the adjustment process. Such a strategy conserves computational resources and sharpens the model’s accuracy and efficiency.

The research’s performance indicates substantial improvements in model efficiency. For instance, the models have been shown to achieve lower error rates in text generation tasks through iterative fine-tuning. They demonstrate up to a 30% reduction in computational overhead compared to traditional fine-tuning methods. Furthermore, these models maintain robustness in output quality, indicating that the iterative process helps prevent overfitting.

In conclusion, the collaborative efforts between Alignment Lab AI and Hive Digital Technologies advance the development of language models. Their research on iterative fine-tuning introduces a sustainable, cost-effective method that enhances model performance without the extensive use of additional resources. This breakthrough addresses key issues like computational efficiency and model accuracy and sets a new standard for how language models can be developed and improved upon in the future.

Check out the Dataset and HF Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 42k+ ML SubReddit

How 'Chain of Thought' Makes Transformers Smarter - MarkTechPost