AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data

Zifan Song1,2*, Yudong Wang2*, Wenwei Zhang2*, Kuikun Liu2, Chengqi Lyu2, Demin Song2, Qipeng Guo2, Hang Yan2, Dahua Lin2,3, Kai Chen2† Cairong Zhao1†,
* Equal Contribution, Corresponding Author
1 Tongji University 2 Shanghai AI Laboratory 3 Chinese University of Hong Kong

Abstract


Open-source Large Language Models (LLMs) and their specialized variants, particularly Code LLMs, have recently delivered impressive performance. However, previous Code LLMs are typically fine-tuned on a single dataset, which may insufficiently elicit the potential of pre-trained Code LLMs. This paper presents AlchemistCoder, a series of Code LLMs with better code generation and generalization abilities fine-tuned on multi-source data. To harmonize the inherent conflicts among the various styles and qualities in multi-source data, we introduce data-specific prompts, termed AlchemistPrompts, inspired by hindsight relabeling, to improve the consistency between instructions and responses. We further propose to incorporate the data evolution process itself into the fine-tuning data to enhance the code comprehension capabilities of LLMs, including instruction evolution, data filtering, and code review. Extensive experiments demonstrate that AlchemistCoder holds a clear lead among all models of the same size (6.7B/7B) and rivals or even surpasses larger models (15B/33B/70B), showcasing the efficacy of our method in refining instruction-following capabilities and advancing the boundaries of code intelligence.

Figure 1. Performance scatter plot (top right is better) of open-source models on mainstream code benchmarks, HumanEval and MBPP.

Introduction


"Alchemist: Someone Who Transforms Things for the Better." —— Merriam Webster

The training of Code LLMs mainly goes through pre-training and fine-tuning stages. Pioneer works have amassed extensive code data for pre-training, while recent open-source models highlight the effectiveness of high-quality or targeted code fine-tuning datasets. Despite these advancements, current fine-tuning methods mainly rely on a particular kind of code-related question-answering dataset, unlike the pre-training stage that integrates code-related corpus from various sources. Such a discrepancy indicates that the fine-tuning data may not be diverse enough to fully stimulate the capabilities of base models, resulting in limited performance, generalization, and robustness.

To tackle these challenges, we first explore integrating data from multiple sources and find that directly mixing (e.g., the DirectlyMix-L-7B model in Fig. 1) does not produce the desired effect due to inherent conflict of multi-source data. Therefore, we propose to adopt hindsight relabeling for multi-source data mixing, which designs data-specific prompts to harmonize the inherent conflicts of different data sources so that they can be used together to elicit the performance of base models more sufficiently. We term this form of prompts as AlchemistPrompts, inspired by the power and definition of Alchemists. Apart from the conventional problem-solution data, we argue that the evolution of code data reflects higher-level capabilities and is also valuable for the learning of Code LLMs. Thus, we decompose the process of data evolution into three tasks incorporated for training, including instruction evolution, data filtering, and code review, enabling further improvements of code comprehension capabilities.

We conduct extensive experiments with various base models and develop the instruction-tuned AlchemistCoder series. As shown in Fig. 1, on two mainstream code benchmarks, HumanEval and MBPP, AlchemistCoder holds a clear lead among all models of the same size (6.7/7B), and rivals or even surpasses larger models (15B/33B/70B), demonstrating harmonized and formidable code capailities. More surprisingly, AlchemistPrompts allow the code corpus also significantly improve the general capability of Code LLMs, as demonstrated by the improvements on MMLU, BBH, and GSM8K.

AlchemistCoder


Figure 2. Overview for developing AlchemistCoder series.

Multi-source Data Construction: To fully harness code LLM capabilities, we gather fine-tuning data from multiple sources and refine instructions' complexity through instruction evolution. Yet, integrating data from diverse sources for instruction tuning presents challenges. Different developers and LLMs offer varied solutions to similar coding questions, leading to diverse response styles and languages. Simply combining data from these sources results in models learning disparate responses, hindering alignment and performance. Therefore, directly mixing multi-source data is not a promising solution and can be detrimental.

AlchemistPrompt: To enhance model learning from diverse data, we introduce AlchemistPrompts, tailored meta-prompts to reconcile data conflicts. Inspired by hindsight relabeling, we employ GPT-4 as an Alchemist to generate these prompts, adjusting instructions to match data specifics. For example, if a task involves Python code with a Bellman-Ford algorithm, the prompt might request Python code utilizing dynamic programming. AlchemistPrompt adjustments are minimal yet effective, with optimal performance achieved by incorporating them into just 5% of samples. This approach balances diversity and domain gap, elevating data quality. By retrospectively analyzing responses and reinterpreting them as alternative goals, AlchemistPrompts refine model comprehension and instruction-following capabilities, fostering a more nuanced learning process.

Code Comprehension Task: Existing training datasets for Code LLMs primarily center on code generation tasks, providing programming problems and solutions. However, we advocate for expanding beyond this, recognizing the value in the higher-level abilities demonstrated during code data construction. Thus, to enhance Code LLM performance, we introduce three code comprehension tasks related to data construction: instruction evolution, data filtering, and code review.

Results


We adopt 9 benchmarks to evaluate our AlchemistCoder series models, including 6 code benchmarks (HumanEval, HumanEval+, MBPP, MBPP+, HumanEval-X, and DS-1000) and 3 mainstream benchmarks (MMLU for multitask language understanding, BBH for comprehensive reasoning, and GSM8K for mathematical ability).


Table 1. Performance of AlchemistCoder series on Python code generation benchmarks (HumanEval/HumanEval+ and MBPP/MBPP+).


Table 2. Performance of AlchemistCoder series on mainstream benchmarks for generic capabilities.

Case Study


The efficacy of AlchemistPrompts is twofold: 1) Harmonization between different data sources: AlchemistPrompts generated from the same LLM have similar styles and can bridge the style differences between sources, while the introduction of AlchemistPrompt-customized data, accounting for only 5%, achieves a balance between data diversity and domain gaps; 2) Harmonization within instruction-response pairs: As fine-grained and data-specific prompts, AlchemistPrompts are designed to augment instructions with specific programming languages, algorithm concepts, and other code-related information involved in responses, which can refine the alignment within instruction-response pairs and enhance the instruction-following abilities of fine-tuned models.


Figure 3. Example #1 of AlchemistPrompts.


Figure 4. Example #2 of AlchemistPrompts.



This webpage template was recycled from here.