Ling: A MoE LLM Provided and Open-sourced by InclusionAI

🤗 Hugging Face   |   🤖 ModelScope

🤗 Hugging Face&nbsp&nbsp | &nbsp&nbsp🤖 ModelScope

Introduction

Ling is a MoE LLM provided and open-sourced by InclusionAI. We introduce two different sizes, which are Ling-lite and Ling-plus. Ling-lite has 16.8 billion parameters with 2.75 billion activated parameters, while Ling-plus has 290 billion parameters with 28.8 billion activated parameters. Both models demonstrate impressive performance compared to existing models in the industry.

Their structure makes it easy to scale up and down and adapt to different tasks, so users can use these models for a wide range of tasks, from processing natural language to solving complex problems. Furthermore, the open-source nature of Ling promotes collaboration and innovation within the AI community, fostering a diverse range of use cases and enhancements.

As more developers and researchers engage with the platform, we can expect rapid advancements and improvements, leading to even more sophisticated applications. This collaborative approach accelerates development and ensures that the models remain at the forefront of technology, addressing emerging challenges in various fields.

Update

  • [2025-5-10] Ling-lite-1.5 has been released! It achieves significant progress in reasoning ability compared with previous Ling-lite.
  • [2025-4-15] Ling-lite is upgraded to Ling-lite-0415. The new model demonstrates notable improvements over its predecessor, Ling-lite-0220, especially on code and math.

Model Downloads

You can download the following table to see the various parameters for your use case. If you are located in mainland China, we also provide the model on ModelScope.cn to speed up the download process.

Model #Total Params #Activated Params Context Length Download
Ling-lite-base-1.5 16.8B 2.75B 128K 🤗 HuggingFace
🤖 ModelScope
Ling-lite-1.5 16.8B 2.75B 128K 🤗 HuggingFace
🤖 ModelScope
Ling-plus-base 290B 28.8B 64K 🤗 HuggingFace
🤖 ModelScope
Ling-plus 290B 28.8B 64K 🤗 HuggingFace
🤖 ModelScope
Ling-coder-lite-base 16.8B 2.75B 16K 🤗 HuggingFace
🤖 ModelScope
Ling-coder-lite 16.8B 2.75B 16K 🤗 HuggingFace
🤖 ModelScope

Note: If you are interested in previous version, please visit the past model collections in Huggingface or ModelScope.

Evaluation

Ling-lite

Standard Benchmarks

Benchmark #shots Ling-lite-1.5 Ling-lite Qwen3-4B-Instruct Qwen3-8B-Instruct Moonlight-16B-A3B-Instruct LLaMA3.1-8B
MMLU(EM) 5 74.33 71.27 70.09 75.97 70.74 68.67
GPQA(Pass@1) 0 36.55 29.73 40.4 47.10 19.51 27.59
HumanEval(Pass@1) 0 87.27 84.38 81.94 85.29 72.94 67.23
LiveCodeBench 2408-2502 (Pass@1) 0 22.7 18.94 21.8 26.88 14.76 18.41
LCBench(pass@1) 0 60.37 46.57 48.61 60.03 28.39 23.13
Math(EM) 0 82.62 72.80 81.46 82.70 67.1 52.42
AIME2024(pass@1) 0 21.88 10.21 20.62 26.25 6.88 7.29
OlympiadBench(pass@1) 0 52.30 36.44 54.33 56.11 32.85 17.04
BBH(EM) 0 75.75 66.38 78.21 79.33 63.45 68.05
IFEval(Prompt Strict) 0 77.70 77.99 81.06 83.55 49.01 73.01
BFCL_live 0 72.15 67.93 65.35 69.83 47.14 49.98

Context Window

image

Evaluation results on the Needle In A Haystack (NIAH) tests. Ling-lite-1.5 has improved long text generation capability and performs well across most context window lengths up to 128K.

Quickstart

🤗 Hugging Face Transformers

Here is a code snippet to show you how to use the chat model with transformers:

from transformers import AutoModelForCausalLM, AutoTokenizermodel_name = "inclusionAI/Ling-lite-1.5"model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto")tokenizer = AutoTokenizer.from_pretrained(model_name)prompt = "Give me a short introduction to large language models."messages = [ {"role": "system", "content": "You are Ling, an assistant created by inclusionAI"}, {"role": "user", "content": prompt}]text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True)model_inputs = tokenizer([text], return_tensors="pt").to(model.device)generated_ids = model.generate( **model_inputs, max_new_tokens=512)generated_ids = [ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

🤖 ModelScope

If you're in mainland China, we strongly recommend you to use our model from 🤖 ModelScope.

Deployment

vLLM

vLLM supports offline batched inference or launching an OpenAI-Compatible API Service for online inference.

Environment Preparation

Since the Pull Request (PR) has not been submitted to the vLLM community at this stage, please prepare the environment by following the steps below:

git clone -b v0.7.3 https://github.com/vllm-project/vllm.gitcd vllmgit apply Ling/inference/vllm/bailing_moe.patchpip install -e .

Offline Inference:

from transformers import AutoTokenizerfrom vllm import LLM, SamplingParamstokenizer = AutoTokenizer.from_pretrained("inclusionAI/Ling-lite-1.5")sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=16384)llm = LLM(model="inclusionAI/Ling-lite", dtype='bfloat16')prompt = "Give me a short introduction to large language models."messages = [ {"role": "system", "content": "You are Ling, an assistant created by inclusionAI"}, {"role": "user", "content": prompt}]text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True)outputs = llm.generate([text], sampling_params)

Online Inference:

vllm serve inclusionAI/Ling-lite \ --tensor-parallel-size 2 \ --pipeline-parallel-size 1 \ --use-v2-block-manager \ --gpu-memory-utilization 0.90

To handle long context in vLLM using YaRN, we need to follow these two steps:

  1. Add a rope_scaling field to the model's config.json file, for example:
{ ..., "rope_scaling": { "factor": 4.0, "original_max_position_embeddings": 32768, "type": "yarn" }}
  1. Use an additional parameter --max-model-len to specify the desired maximum context length when starting the vLLM service.

For detailed guidance, please refer to the vLLM instructions.

MindIE

This subject outlines the primary processes for executing a Ling MoE model with specified hardware and the MindIE inference framework.

Configure preparation

Create a model directory on the host for downloading, the directory example is: /root/models', which is used to mount the docker container later.

Download the mindie-related configuration from github:

cd /root/modelsgit clone git@github.com:inclusionAI/Ling.git

Machine network environment check

# Check the physical linkfor i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done# Check the linksfor i in {0..7}; do hccn_tool -i $i -link -g ; done# Check your network healthfor i in {0..7}; do hccn_tool -i $i -net_health -g ; done# Check whether the detected IP address is correctly configuredfor i in {0..7}; do hccn_tool -i $i -netdetect -g ; done# Check whether the gateway is configured correctlyfor i in {0..7}; do hccn_tool -i $i -gateway -g ; done# Check the consistency of the underlying TLS verification behavior of the NPU, recommend that all 0 befor i in {0..7}; do hccn_tool -i $i -tls -g ; done | grep switch# The underlying TLS check line of the NPU is set to 0for i in {0..7}; do hccn_tool -i $i -tls -s enable 0; done

Pull the image

Go to Ascend Community/Development Resources and pull the mindie image

Image version: 1.0.0-800I-A2-py311-openeuler24.03-lts

The versions of each component are as follows:

Component Version
MindIE 1.0.0
CANN 8.0.0
PTA 6.0.0.beta1
HDK 24.1.0

Container startup and configuration changes

Start the container

Execute the following startup command (reference):

docker run -itd --privileged --name=container name --net=host \--shm-size 500g \--device=/dev/davinci0 \--device=/dev/davinci1 \--device=/dev/davinci2 \--device=/dev/davinci3 \--device=/dev/davinci4 \--device=/dev/davinci5 \--device=/dev/davinci6 \--device=/dev/davinci7 \--device=/dev/davinci_manager \--device=/dev/hisi_hdc \--device /dev/devmm_svm \-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \-v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \-v /usr/local/sbin:/usr/local/sbin \-v /etc/hccn.conf:/etc/hccn.conf \-v /root/models:/home/HwHiAiUser/Ascend \mindie: 1.0.0-XXX-800I-A2-arm64-py3.11 (modified according to the name of the loaded image) \bash
Download the model

In this case, we use ModelScope to download the model, and install ModelScope first:

pip install modelscope

Download the model:

# The model takes a long time to download and can be executed in the backgroundnohup modelscope download --model inclusionAI/Ling-plus --local_dir /home/HwHiAiUser/Ascend/Ling_plus 2>&1 > /tmp/ling_plus.log &nohup modelscope download --model inclusionAI/Ling-plus-base --local_dir /home/HwHiAiUser/Ascend/Ling_plus_base 2>&1 > /tmp/ling_plus_base.log &nohup modelscope download --model inclusionAI/Ling-lite --local_dir /home/HwHiAiUser/Ascend/Ling_lite 2>&1 > /tmp/ling_lite.log &nohup modelscope download --model inclusionAI/Ling-lite-base --local_dir /home/HwHiAiUser/Ascend/Ling_lite_base 2>&1 > /tmp/ling_lite_base.log &

After the download is completed, you need to change the file permissions, otherwise an error will be reported when MindIE-Service is started:

chmod -R 750 *.json *.py
Model weight format conversion

This section applies to the Ling Lite model, the Ling Plus model does not need to worry about this chapter

mindie supports safetensors format weights, if the download weights are not in safetensors format, you need to convert the weights, take Ling Lite as an example, the conversion command is as follows:

# Convert Ling litepython /home/HwHiAiUser/Ascend/Ling/inference/mindie/convert_bin_to_safetensor.pycd /home/HwHiAiUser/Ascend/Ling_litecp README.md configuration.json config.json special_tokens_map.json modeling_bailing_moe.py tokenizer.json tokenizer_config.json ../Ling_lite_safetensor/# Convert Ling lite basepython /home/HwHiAiUser/Ascend/Ling/inference/mindie/convert_bin_to_safetensor_base.pycd /home/HwHiAiUser/Ascend/Ling_lite_basecp README.md configuration.json config.json special_tokens_map.json modeling_bailing_moe.py tokenizer.json tokenizer_config.json ../Ling_lite_base_safetensor/

The path of loading the Ling Lite model is changed to '/home/HwHiAiUser/Ascend/Ling_lite_safetensor', and the path of the Ling Lite Base model is changed to '/home/HwHiAiUser/Ascend/Ling_lite_base_safetensor'

Change the model configuration

The default model configuration file (config.json) mindie cannot be loaded directly, and needs to be changed:

# Adapt to mindie's Ling lite model configurationcp /home/HwHiAiUser/Ascend/Ling_lite_safetensor/config.json /home/HwHiAiUser/Ascend/Ling_lite_safetensor/config.json.bakcp /home/HwHiAiUser/Ascend/Ling/inference/mindie/lite/model_chat_config.json /home/HwHiAiUser/Ascend/Ling_lite_safetensor/config.jsonchmod 750 /home/HwHiAiUser/Ascend/Ling_lite_safetensor/config.json# Adapt to mindie's Ling lite base model configurationcp /home/HwHiAiUser/Ascend/Ling_lite_base_safetensor/config.json /home/HwHiAiUser/Ascend/Ling_lite_base_safetensor/config.json.bakcp /home/HwHiAiUser/Ascend/Ling/inference/mindie/lite/model_base_config.json /home/HwHiAiUser/Ascend/Ling_lite_base_safetensor/config.jsonchmod 750 /home/HwHiAiUser/Ascend/Ling_lite_base_safetensor/config.json# Adapt to mindie's Ling plus model configurationcp /home/HwHiAiUser/Ascend/Ling_plus/config.json /home/HwHiAiUser/Ascend/Ling_plus/config.json.bakcp /home/HwHiAiUser/Ascend/Ling/inference/mindie/plus/model_chat_config.json /home/HwHiAiUser/Ascend/Ling_plus/config.jsonchmod 750 /home/HwHiAiUser/Ascend/Ling_plus/config.json# Adapt to mindie's Ling plus base model configurationcp /home/HwHiAiUser/Ascend/Ling_plus_base/config.json /home/HwHiAiUser/Ascend/Ling_plus_base/config.json.bakcp /home/HwHiAiUser/Ascend/Ling/inference/mindie/plus/model_base_config.json /home/HwHiAiUser/Ascend/Ling_plus_base/config.jsonchmod 750 /home/HwHiAiUser/Ascend/Ling_plus_base/config.json

Execute the shell script that adapts the mindie to the Ling model:

bash /home/HwHiAiUser/Ascend/Ling/inference/mindie/patch_atb_llm.sh

Stand-alone Servitization Inference (Ling lite)

Set the underlying environment variables:

source /usr/local/Ascend/atb-models/set_env.sh

Set different mindie configurations according to the model type:

# Ling Litecp /home/HwHiAiUser/Ascend/Ling/inference/mindie/lite/config.json /usr/local/Ascend/mindie/latest/mindie-service/conf/config.json# Ling Lite basecp /home/HwHiAiUser/Ascend/Ling/inference/mindie/lite/config.base.json /usr/local/Ascend/mindie/latest/mindie-service/conf/config.json

Start the mindie service:

chmod 640 /usr/local/Ascend/mindie/latest/mindie-service/conf/config.jsoncd $MIES_INSTALL_PATHnohup ./bin/mindieservice_daemon > /tmp/service.log 2>&1 &

Check /tmp/service.log to check whether the output is Daemon start success!, if so, it means that MindIE-Service has started successfully.

Test if the request is correct:

# Chat modelwget -O- --post-data="{\"messages\":[{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"Who are you?\"}], \"stream\": false, \"max_tokens\":100, \"model\": \"bailing_moe\", \"temperature\":0}" \--header='Content-Type:application/json' \'http://127.0.0.1:1025/v1/chat/completions'# base modelwget -O- --post-data='{"inputs":"My name is Olivier and I","stream":false,"parameters":{"temperature":1,"max_new_tokens":100,"do_sample":false}}' \--header='Content-Type:application/json' \'http://127.0.0.1:1025/infer'

Multi-machine service-based inference (Ling plus)

All of the following commands need to be executed simultaneously on all machines.

To enable multi-machine service-based inference, you need to configure a multi-machine ranktable file.

  • Get the