CIF-Bench

A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models

Yizhi Li*, Ge Zhang* , Xingwei Qu* ,
Jiali Li, Zhaoqun Li, Zekun Wang, Hao Li, Ruibin Yuan, Yinghao Ma, Kai Zhang, Wangchunshu Zhou, Yiming Liang, Lei Zhang, Lei Ma, Jiajun Zhang, Zuowen Li, Stephen W. Huang,
Chenghua Lin^† , Wenhu Chen^† , Jie Fu^†

arXiv Code

Overview of the CIF-Bench. It is distinguished by its comprehensiveness. The radii have three groups, determined by the number of tasks contained (≤ 10, ≤ 20, and > 20).

Abstract

The advancement of large language models (LLMs) has enhanced the ability to generalize across a wide range of unseen natural language processing (NLP) tasks through instruction-following. Yet, their effectiveness often diminishes in low-resource languages like Chinese, exacerbated by biased evaluations from data leakage, casting doubt on their true generalizability to new linguistic territories. In response, we introduce the Chinese Instruction-Following Benchmark (CIF-Bench), designed to evaluate the zero-shot generalizability of LLMs to the Chinese language. CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances across 20 categories. To mitigate evaluation bias, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance, totaling 45,000 data instances. Our evaluation of 28 selected LLMs reveals a noticeable performance gap, with the best model scoring only 52.9%, highlighting the limitations of LLMs in less familiar language and task contexts. This work aims to uncover the current limitations of LLMs in handling Chinese tasks, pushing towards the development of more culturally informed and linguistically diverse models with the released data and benchmark.

Overview

we introduce the Chinese Instruction-Following Benchmark (CIF-Bench), a novel benchmark designed for the zero-shot generalizability evaluation of LLMs, with Chinese serving as an insightful example for multilingual transferred instruction-following tasks. Our benchmark comprises 150 tasks and 15, 000 input-output pairs, with the assistance of native speaker annotators, ensuring the inclusion of human-authored tasks that are not only challenging but also naturally expressed. A significant portion (38.7%) of these tasks are designed to test a model’s complex natural language inference (NLI) and reasoning capabilities, as well as drawing upon Chinese culture spread across 20 distinct categories. In an effort to mitigate future evaluation biases from data leakage, we decide to publicly release only half of the data instances, reserving the rest as a private dataset to maintain an impartial benchmark. Furthermore, CIF-Bench enhances its robustness by introducing 5 variations of instructions per task, using these to diminish score variance in private split evaluations. CIF-Bench also pioneers a model-based automatic pipeline designed to tackle the inherent challenges of evaluating open-ended natural language generation outputs.

A large language model can tackle English task translated to Chinese, but fail to respond to instruction originally in Chinese.

By selecting a range of popular LLMs that support Chinese for evaluation, we aim to depict the limits of current instruction-following capabilities in language transfer contexts as the many models follow an English-oriented pre-training paradigm (Huang et al., 2023b). Our findings reveal that even the best-performing model achieves a score of only 52.9% on CIF-Bench, underscoring the gap that exists when LLMs are confronted with tasks in a less-familiar language and unseen data instances. We find that this performance decrement is particularly noticeable in scenarios involving unseen tasks and unseen input-output pairs, contrasting with the models’ performance on existing Chinese datasets and translated English-language tasks. Such results suggest that while LLMs exhibit impressive generalizability in a context more aligned with observed data, their effectiveness diminishes when faced with the dual challenges of unacquainted languages and novel tasks.

To summarize our contributions, we: (1) Present a new benchmark that addresses a critical gap in existing NLP research by focusing on the generalizability of LLMs to an underrepresented language in terms of training and evaluation resources; (2) Construct an instruction-following evaluation dataset with 150 tasks and 45, 000 data samples, and release half of the input-output pairs for future LLM evaluation research; (3) Provide an in-depth analysis of 28 LLMs, revealing their limitations in adapting to less familiar languages and task contexts, offering insights into where improvements are needed for instruction-following generalizability.

Leaderboard

As the CIF-Bench aims to provide a comprehensive evaluation of the LLM instruction-following capability, we argue that the metrics should be designed case by case in task granularity to evaluate the open-ended textual outputs, rather than simply reformatting all tasks into choice questions and using the conditional probability to approximate the models' predictions.

After a thorough review of the task instructions, we categorize the output requirements into the four following types and design corresponding task-level metrics. Multi-class Classification: We use accuracy as the metric if the task requires the model to predict one label from 2 or more classes in the output. Multi-label Classification: We use F1 score as the metric if the task requires the model to predict one label from 2 or more classes in the output. Creative Generation: Regarding the tasks that have no absolute criteria of the standard answer, we require a model-based evaluator to provide information regarding a given output, including creativity, fluency, the level of instruction-following, and the confidence of the evaluator. Semantic Similarity: For the remaining tasks that can be evaluated by the semantic similarity between the golden reference and model output, we use a pre-trained language. All scores used in CIF-Bench either naturally range from 0 to 1, or are normalized to the same range.

One core dilemma in evaluating the open-ended instruction-following capabilities of LLMs is that model predictions are hard to verify even with reference answers. For instance, it is intractable to handcraft regex rules to extract the predictions from LLMs for the extensive number of tasks, since the answers could be expressed in various formats, or drowned in redundant contexts like reasoning progress. Inspired by G-Eval, we leverage OpenAI's GPT-4 as a relatively reliable evaluator for multi-class classification, multi-label classification, and creative generation tasks, to overcome such issues. The GPT-4 evaluator is prompted to assess the outputs according to the given task instruction and the input-output reference. For the answers that can be evaluated with semantic similarity, we use a lightweight multilingual encoder, BLEURT, to measure the relevance between the reference and LLM output.

Given a set of task instructions $I$ , we denote the performance score of model $m$ on task $t$ as: $S^{m}_{t}=\frac{1}{|D_t|}\sum_{d \in D_t}\frac{1}{|I|}\sum_{i \in I}{{s^{m}_{t}(i,d)}}$ , where $D_t$ refers to the set of data samples for task $t$ . In the case of the public split, the instruction set $I$ is reduced to one single element. In we take the average of task-level scores $\overline{S^m}$ as the indicator of overall performance for a model $m$ .

Leaderboard

Model Name	Overall	Chinese Culture	Classification	Code	Commonsense	Creative NLG	Evaluation	Grammar	Linguistic	Motion Detection	NER	NLI	QA	Reasoning	Role Playing	Sentiment	Structured Data	Style Transfer	Summarization	Toxic	Translation
Baichuan2-13B-Chat	0.529	0.520	0.674	0.333	0.641	0.497	0.686	0.542	0.528	0.578	0.563	0.632	0.569	0.515	0.752	0.624	0.459	0.462	0.332	0.441	0.273
Qwen-72B-Chat	0.519	0.486	0.630	0.296	0.634	0.508	0.634	0.458	0.520	0.494	0.550	0.626	0.565	0.528	0.762	0.613	0.496	0.459	0.282	0.608	0.271
Yi-34B-Chat	0.512	0.483	0.606	0.347	0.623	0.497	0.598	0.480	0.490	0.575	0.525	0.619	0.554	0.494	0.757	0.580	0.472	0.439	0.346	0.514	0.259
Qwen-14B-Chat	0.500	0.481	0.582	0.307	0.614	0.494	0.645	0.428	0.475	0.496	0.513	0.616	0.548	0.507	0.764	0.583	0.469	0.453	0.283	0.575	0.262
Deepseek-Llm-67B-Chat	0.471	0.467	0.571	0.259	0.577	0.486	0.549	0.442	0.476	0.475	0.509	0.566	0.496	0.439	0.711	0.546	0.409	0.436	0.262	0.570	0.235
Baichuan-13B-Chat	0.450	0.408	0.491	0.286	0.552	0.439	0.670	0.417	0.422	0.482	0.486	0.565	0.505	0.377	0.704	0.552	0.387	0.402	0.350	0.431	0.304
Chatglm3-6B	0.436	0.381	0.439	0.330	0.541	0.452	0.577	0.310	0.358	0.436	0.453	0.544	0.503	0.414	0.762	0.560	0.446	0.402	0.321	0.391	0.270
Yi-6B-Chat	0.417	0.402	0.454	0.313	0.523	0.425	0.506	0.383	0.383	0.487	0.396	0.523	0.457	0.369	0.754	0.482	0.401	0.380	0.310	0.455	0.227
Baichuan2-7B-Chat	0.412	0.437	0.647	0.160	0.520	0.402	0.580	0.511	0.444	0.455	0.407	0.489	0.395	0.406	0.670	0.517	0.342	0.298	0.101	0.463	0.138
Chatglm2-6B	0.352	0.278	0.469	0.346	0.403	0.424	0.535	0.274	0.397	0.406	0.240	0.397	0.352	0.326	0.714	0.438	0.298	0.313	0.320	0.461	0.190
Chatglm-6B-Sft	0.349	0.265	0.454	0.365	0.385	0.462	0.554	0.296	0.379	0.427	0.232	0.380	0.321	0.292	0.718	0.415	0.296	0.333	0.351	0.441	0.190
Chinese-Llama2-Linly-13B	0.344	0.250	0.462	0.311	0.399	0.429	0.557	0.273	0.358	0.385	0.268	0.390	0.330	0.313	0.653	0.433	0.279	0.332	0.292	0.457	0.181
Gpt-3.5-Turbo-Sft	0.343	0.269	0.427	0.298	0.389	0.395	0.575	0.325	0.365	0.389	0.226	0.382	0.394	0.345	0.710	0.433	0.324	0.266	0.290	0.397	0.225
Chinese-Alpaca-2-13B	0.341	0.242	0.421	0.356	0.382	0.442	0.602	0.256	0.363	0.430	0.210	0.376	0.334	0.317	0.714	0.459	0.299	0.316	0.308	0.452	0.200
Chinese-Alpaca-13B	0.334	0.250	0.399	0.348	0.364	0.435	0.616	0.275	0.349	0.421	0.223	0.370	0.309	0.319	0.724	0.426	0.285	0.307	0.298	0.445	0.181
Chinese-Alpaca-7B	0.334	0.216	0.412	0.378	0.381	0.425	0.576	0.265	0.359	0.393	0.243	0.383	0.326	0.295	0.710	0.409	0.301	0.327	0.325	0.405	0.186
Chinese-Llama2-Linly-7B	0.333	0.218	0.451	0.330	0.396	0.427	0.583	0.248	0.350	0.410	0.231	0.367	0.345	0.276	0.698	0.433	0.259	0.315	0.310	0.469	0.168
Tigerbot-13B-Chat	0.331	0.205	0.397	0.309	0.385	0.420	0.614	0.310	0.379	0.341	0.276	0.363	0.329	0.301	0.694	0.419	0.280	0.310	0.283	0.393	0.186
Telechat-7B	0.329	0.267	0.338	0.321	0.420	0.404	0.420	0.272	0.265	0.327	0.320	0.388	0.355	0.244	0.672	0.344	0.334	0.335	0.299	0.364	0.184
Ziya-Llama-13B	0.329	0.196	0.402	0.324	0.341	0.428	0.616	0.312	0.349	0.400	0.228	0.351	0.279	0.313	0.721	0.468	0.311	0.291	0.278	0.431	0.175
Chinese-Alpaca-33B	0.326	0.234	0.370	0.372	0.364	0.429	0.614	0.246	0.318	0.377	0.221	0.368	0.300	0.314	0.713	0.428	0.288	0.303	0.295	0.401	0.199
Tigerbot-7B-Chat	0.325	0.218	0.395	0.306	0.370	0.413	0.631	0.294	0.370	0.368	0.215	0.355	0.313	0.292	0.713	0.415	0.283	0.315	0.290	0.389	0.171
Chinese-Alpaca-2-7B	0.323	0.215	0.374	0.335	0.366	0.415	0.546	0.257	0.326	0.395	0.215	0.375	0.318	0.289	0.698	0.417	0.285	0.303	0.312	0.439	0.193
Aquilachat-7B	0.309	0.162	0.234	0.291	0.320	0.437	0.344	0.135	0.266	0.309	0.287	0.337	0.342	0.236	0.609	0.255	0.249	0.400	0.527	0.430	0.306
Moss-Moon-003-Sft	0.302	0.214	0.405	0.274	0.347	0.380	0.448	0.305	0.341	0.378	0.232	0.317	0.321	0.267	0.694	0.375	0.251	0.259	0.288	0.424	0.152
Qwen-7B-Chat	0.301	0.211	0.410	0.289	0.349	0.391	0.531	0.219	0.387	0.404	0.208	0.325	0.297	0.278	0.681	0.419	0.266	0.251	0.248	0.371	0.157
Belle-13B-Sft	0.264	0.198	0.307	0.285	0.316	0.349	0.409	0.237	0.305	0.222	0.177	0.317	0.284	0.242	0.631	0.299	0.244	0.222	0.234	0.296	0.133
Cpm-Bee-10B	0.244	0.234	0.377	0.024	0.278	0.311	0.255	0.302	0.278	0.327	0.148	0.286	0.224	0.147	0.603	0.277	0.117	0.263	0.220	0.352	0.125

BibTeX


      @article{li2024cifbench,
        title={CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models}, 
        author={Yizhi LI and Ge Zhang and Xingwei Qu and Jiali Li and Zhaoqun Li and Zekun Wang and Hao Li and Ruibin Yuan and Yinghao Ma and Kai Zhang and Wangchunshu Zhou and Yiming Liang and Lei Zhang and Lei Ma and Jiajun Zhang and Zuowen Li and Stephen W. Huang and Chenghua Lin and Wenhu Chen and Jie Fu},
        year={2024},
        eprint={2402.13109},
        archivePrefix={arXiv},
        primaryClass={cs.CL}
      }

CIF-Bench

A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models

Abstract

CIF-Bench

Overview

Experiment Results

Leaderboard

Leaderboard

BibTeX