


learn domain-specific knowledge by pre-training on the educational corpus(语料库) and stimulate(促进,激发) various skills with tool use by fine-tuning on designed system prompts and instructions.

1 Introduction

LLMs obtained the ability of reasoning, long-range context modeling, and task generalization by training on large-scale textual corpus with some strategies, such as code pretraining (Chen et al., 2021), instruction tuning (Wei et al., 2022), and reinforcement learning from human feedback (RLHF) (Stiennon et al., 2020).


However, there are several challenges of applying LLMs into education domain. One challenge (C1) is that there is still a gap between the LLMs and the educational expert since LLMs are pretrained on the general corpus, which lack sufficient educational knowledge and can not align well with real scenarios (e.g., essay assessment). The other challenge (C2) is that the knowledge in the field of education is updating, while LLMs can not learn up-to-date knowledge due to the training mechanism. Moreover, LLMs suffer from the hallucination problem, and may generate responses that are not truthful.




For C1, we pre-train LLMs on a large number of educational books (e.g., psychology, ancient poetry) and 4 million cleaned diverse instructions to learn the fundamental knowledge. Then, we finetune the model on 500 thousand high-quality customized instructions to activate education-specific functions (e.g., essay assessment, Socratic teaching and emotional support), by aligning with the feedbacks from psychology experts and frontline teachers.

For C2, we explore a retrieval-augmented technology, which enables LLMs to automatically judge the helpfulness of the retrieved information, and generate the response based on the relevant information and knowledge stored in LLMs. In this way, our EduChat can access the latest information from the internet, ensuring that the responses are accurate and credible.

Diverse system prompts and instructions are designed to control the tool use and stimulate different skills, which alleviates the problem of hallucination and is more applicable in real education scenarios;

2 Related Work

In education, Baladn et al. (2023) tune open-source LLMs for generating better teacher responses in BEA 2023 Shared Task (Tack et al., 2023). But challenges still exist, such as the lack of domain knowledge in general LLMs and the necessity for them to align with educational abilities (e.g., essay assessment, emotional support, and Socratic teaching).

EduChat is pre-trained on a diverse education corpus to ensure the alignment of EduChat with educational abilities.

3 Core Functions of EduChat

Retrieval-Augmented Open Question Answering(QA)

Fine-grained Essay Assessment

overall scores, aspectlevel ratings, and detailed comments on content, expression, paragraph, and overall evaluation.

can identify standout sentences, highlighting strengths and areas for improvement, enabling personalized guidance for students’ essay writing skills.


Socratic Teaching


Psychology-based Emotional Support


4 Data Construction

4.1 Pre-training Data

Textbooks Data

In our research, we gather a vast amount of educational textbook and online question bank data from Chinese middle and high school exams for pre-training. Additionally, we enrich our model with over 70,000 Chinese poetries, providing detailed information on authors, backgrounds, and poetry appreciation to enhance its poetry creation and appreciation capabilities. To facilitate empathetic emotional support dialogues, we carefully select 60 famous works from hundreds of psychology books. These selected books belong to two main categories. The first category consists of 15 branches of psychological theory, including developmental and educational psychology, social psychology, behavioral psychology, counseling psychology and others. The second category contains various psychological practices, which offer practical cases of psychological consultation and emotional support dialogues. By incorporating the diverse fundamental data into pre-training, our model gains a deeper understanding of education and psychology, enabling it to generate more helpful responses.


Fundamental Instruction Data

To achieve a more natural human-computer interaction, we collect a large volume of bilingual instruct tuning data from reputable open-source repositories like Alpaca5, BELLE (Ji et al., 2023), GPT4All6, Open- Assistant7, FLANCoT8, and Firefly9. The data spans various task types, enabling our models to acquire foundational instruction following capabilities for diverse instruction types. In addition, we source high-quality multi-turn dialogue data from MOSS (Sun et al., 2023), BELLE (Ji et al., 2023), COIG (Zhang et al., 2023a), LIMA (Zhou et al., 2023a), and ShareGPT10. This data covers various dialogue contexts, including role-playing, creative writing, and code-related discussions, ensuring our models’ competence in engaging and sustaining meaningful multi-turn conversations.


4.2 Fine-tuning Data

we construct the Educational Instruction Data for finetuning,which covers retrieval-augmented open QA(22.6%),emotional support(29.4%), Socratic teaching(16.8%) and essay assessment(31.2%).

Retrieval-Augmented Open QA Data

To address hallucination and timely knowledge issues in Open QA, we design a retrieval-augmented open QA technique. We sample high-quality data through ChatGPT scoring in relevant Open QA and Subject QA datasets. To tackle irrelevant retrieved content, we introduce self-checking. ChatGPT assesses whether the retrieval content helps answer the question and then generates the answer using an self-check, incorporating the useful retrieval content and questions. To maintain data quality, we manually verify the data during this process.


Emotional Support Data

To overcome the scarcity of Chinese emotional support dialogue data, we adopt a translation and expansion approach. We translate the widely-used English emotional support dataset, ESConv (Liu et al., 2021), into Chinese as ESConv-zh. After manual review and cleaning, we simulate multi-agent dialogues based on various patient scenarios within ESConvzh and also collect real-life Chinese psychological counseling consultation data, incorporating patient information and diagnosis results. By training our models on diverse datasets, we empower them to provide robust emotional support and act as compassionate counselors during consultations.

Socratic Teaching Data

Teachers play a key role in guiding and encouraging heuristic exploration rather than just providing answers. To support this, we generate dialogues simulating the Socratic teaching method by incorporating multi-step Q&A involving counter-questions, challenges, and inquiries. These dialogues are manually evaluatedfor accuracy, fluency, and progression from easyto complex questions. Integrating this dataset into training equips our model with a strong capability in Socratic teaching, distinguishing it from other LLMs that only offer direct answers.


Essay Assessment Data

The lack of timely and detailed feedback often hinders students’ writing improvement. To tackle this issue, we create a high-quality essay assessment dataset. Initially, we collect essays and employ ChatGPT to evaluate them in terms of content, expression, and overall quality. To ensure data quality, we invite pedagogical experts to manually curate the comments. This dataset empowers EduChat with the ability to provide students with high-quality feedback, aiding in the enhancement of their writing skills.

4.3 Data Preprocessing

To enhance data quality, we conduct semantic-level deduplication on the dataset. Using the sentencetransformers model (Reimers and Gurevych, 2019), we obtain sentence embeddings for each data point and calculate cosine similarity between all pairs of embeddings. For similarities exceeding a threshold of 0.7, we remove one of the duplicates. We implement the similarity calculation using CUDA for GPU acceleration, speeding up the process.



5 EduChat

5.1 Training Procedure of EduChat

The training of EduChat is mainly divided into two stages: fundamental capabilities acquisition and educational skills acquisition. In the first stage, we pre-train the model on educational books and Q&A pairs (detailed in Section 4.1) to equip it with foundational knowledge across disciplines. Besides, large-scale instruction tuning and opendomain dialogue datasets are also incorporated to enable basic instruction following ability and dialogue ability (detailed in Section 4.2). In the second stage, we develop EduChat’s pedagogical skills by fine-tuning the model on our carefully curated data, including retrieval-augmented open QA, emotional support, Socratic teaching and essay assessment datasets mentioned in Section 4.2

5.2 Online Knowledge Retrieval

Existing generative LLMs all suffer from the issues of generating hallucinations and outdated information, which is detrimental to an educational model. To mitigate this problem, we introduce self-check as shown in Figure 2. Specifically, when online knowledge retrieval is enabled, the model picks useful retrieval results by asking itself "Is this helpful for answering the question?" and append filtered snippets before the dialogue history.

5.3 System Prompt Design


6 Experimental Results

6.1 Resutls of C-Eval


