GLM Chat Completion API Application and Usage

GLM (General Language Model) is a new generation of large language model series launched by Zhipu AI (Zhipu AI / Z.ai), which has strong capabilities in understanding and generating both Chinese and English. It performs excellently in tasks such as Chinese scenarios, code generation, reasoning, and multi-turn dialogue. The new generation models like GLM-5.1, GLM-4.7, and GLM-4.6 have made significant optimizations in long context, tool invocation, and code tasks, and can be widely applied in scenarios such as intelligent Q&A, content creation, code assistance, and customer service robots.

This document mainly introduces the usage process of the GLM Chat Completion API, which allows you to easily call the GLM series models through a unified OpenAI compatible interface.

¶ Application Process

To use the GLM Chat Completion API, you can first visit the GLM Chat Completion API page and click the "Acquire" button to obtain the credentials needed for the request:

If you are not logged in or registered, you will be automatically redirected to the login page inviting you to register and log in. After logging in or registering, you will be automatically returned to the current page.

There will be a free quota granted upon the first application, allowing you to use the API for free.

¶ Basic Usage

The request address for the GLM Chat Completion API is https://api.acedata.cloud/glm/chat/completions, using Bearer Token for authentication, and the request body is compatible with the OpenAI Chat Completions protocol.

When using this interface for the first time, we need to fill in at least three pieces of content:

authorization: Select the Bearer Token directly from the dropdown list.
model: Choose the GLM model to call, the currently supported models include:
- glm-5.1: The latest flagship model with the strongest overall capabilities.
- glm-4.7: Excellent performance in reasoning, tool invocation, and code tasks.
- glm-4.6: General dialogue model, balancing effect and cost.
- glm-4.5-air: Lightweight version, faster response, lower price, suitable for high concurrency scenarios.
- glm-3-turbo: Classic dialogue model, suitable for general text generation tasks.
messages: An array of prompts, each message contains role and content, where role supports three roles: user, assistant, system.

Common optional parameters:

max_tokens: Limits the maximum number of tokens in a single reply.
temperature: Generation randomness, between 0-2, the larger the value, the more divergent.
top_p: Nucleus sampling parameter, controlling the cumulative probability threshold of candidate tokens.
n: How many candidate replies to generate at once.
stream: Whether to enable streaming response, default is false.
stop: Custom stop sequence.

Here is a simple Python calling example:

import requests

url = "https://api.acedata.cloud/glm/chat/completions"

headers = {
    "accept": "application/json",
    "authorization": "Bearer {token}",
    "content-type": "application/json"
}

payload = {
    "model": "glm-4.5-air",
    "messages": [
        {"role": "user", "content": "hello"}
    ]
}

response = requests.post(url, json=payload, headers=headers)
print(response.text)

After the call, we find the returned result as follows:

{
  "id": "msg_202604262252030313862701a04e33",
  "model": "glm-4.5-air",
  "object": "chat.completion",
  "created": 1777215124,
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! 👋 How can I assist you today?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 23,
    "total_tokens": 33
  }
}

The main fields of the returned result are explained as follows:

id: The unique ID of this dialogue task.
created: The creation time of this dialogue task (Unix timestamp, in seconds).
model: The name of the GLM model actually called.
choices: The list of replies generated by the model. choices[i].message.content is the specific text of the model's reply, and finish_reason indicates the reason for ending (stop, length, tool_calls, content_filter, etc.).
usage: Token usage statistics for this request, including prompt_tokens, completion_tokens, and total_tokens.

¶ Streaming Response

This interface supports streaming responses (Server-Sent Events), which is very useful for web integration, allowing the webpage to achieve a word-by-word display effect.

To return responses in a streaming manner, simply set the stream parameter in the request body to true.

Python sample calling code:

import requests

url = "https://api.acedata.cloud/glm/chat/completions"

headers = {
    "accept": "application/json",
    "authorization": "Bearer {token}",
    "content-type": "application/json"
}

payload = {
    "model": "glm-4.5-air",
    "messages": [{"role": "user", "content": "hi"}],
    "stream": True
}

response = requests.post(url, json=payload, headers=headers, stream=True)
for line in response.iter_lines():
    if line:
        print(line.decode("utf-8"))

The output effect is as follows (excerpt):

data: {"id": "msg_2026042622521271f765bbc3734ce1", "object": "chat.completion.chunk", "created": 1777215133, "model": "glm-4.5-air", "choices": [{"delta": {"content": "", "role": "assistant"}, "finish_reason": null, "index": 0}], "usage": null}

data: {"id": "msg_2026042622521271f765bbc3734ce1", "object": "chat.completion.chunk", "created": 1777215133, "model": "glm-4.5-air", "choices": [{"delta": {"content": "Hello! How can I"}, "finish_reason": null, "index": 0}], "usage": null}

data: {"id": "msg_2026042622521271f765bbc3734ce1", "object": "chat.completion.chunk", "created": 1777215133, "model": "glm-4.5-air", "choices": [{"delta": {"content": "assist you"}, "finish_reason": null, "index": 0}], "usage": null}

data: {"id": "msg_2026042622521271f765bbc3734ce1", "object": "chat.completion.chunk", "created": 1777215133, "model": "glm-4.5-air", "choices": [{"delta": {"content": "?"}, "finish_reason": null, "index": 0}], "usage": null}

data: {"id": "msg_2026042622521271f765bbc3734ce1", "object": "chat.completion.chunk", "created": 1777215133, "model": "glm-4.5-air", "choices": [{"delta": {}, "finish_reason": "stop", "index": 0}], "usage": null}

data: {"id": "msg_2026042622521271f765bbc3734ce1", "object": "chat.completion.chunk", "created": 1777215133, "model": "glm-4.5-air", "choices": [], "usage": {"prompt_tokens": 1420, "completion_tokens": 18, "total_tokens": 1438}}

data: [DONE]

You can see that there are many data in the response, each containing an incremental chunk. choices[i].delta.content is the newly added text segment for the current chunk, and you can concatenate these segments to form a complete reply. When the content of data is [DONE], it indicates the end of the streaming response. The last chunk with usage will summarize the token usage for this request.

JavaScript (Node.js) example:

const options = {
  method: "POST",
  headers: {
    accept: "application/json",
    authorization: "Bearer {token}",
    "content-type": "application/json"
  },
  body: JSON.stringify({
    model: "glm-4.5-air",
    messages: [{ role: "user", content: "hi" }],
    stream: true
  })
};

const response = await fetch("https://api.acedata.cloud/glm/chat/completions", options);
const reader = response.body.getReader();
const decoder = new TextDecoder("utf-8");
while (true) {
  const { value, done } = await reader.read();
  if (done) break;
  process.stdout.write(decoder.decode(value));
}

Java example code:

JSONObject jsonObject = new JSONObject();
jsonObject.put("model", "glm-4.5-air");
jsonObject.put("messages", new JSONArray().put(new JSONObject().put("role", "user").put("content", "hi")));
jsonObject.put("stream", true);
MediaType mediaType = MediaType.parse("application/json; charset=utf-8");
RequestBody body = RequestBody.create(jsonObject.toString(), mediaType);
Request request = new Request.Builder()
  .url("https://api.acedata.cloud/glm/chat/completions")
  .post(body)
  .addHeader("accept", "application/json")
  .addHeader("authorization", "Bearer {token}")
  .addHeader("content-type", "application/json")
  .build();

OkHttpClient client = new OkHttpClient();
Response response = client.newCall(request).execute();
System.out.println(response.body().string());

Other languages can be rewritten similarly; the principle is the same.

¶ Multi-turn Dialogue

If you want to implement multi-turn dialogue functionality, you need to sequentially place the historical dialogue into the messages array, maintaining the alternating order of user and assistant.

Python example call code:

import requests

url = "https://api.acedata.cloud/glm/chat/completions"

headers = {
    "accept": "application/json",
    "authorization": "Bearer {token}",
    "content-type": "application/json"
}

payload = {
    "model": "glm-4.5-air",
    "messages": [
        {"role": "user", "content": "Hello"},
        {"role": "assistant", "content": "Hi! How can I assist you today?"},
        {"role": "user", "content": "What did I say just now?"}
    ]
}

response = requests.post(url, json=payload, headers=headers)
print(response.text)

By uploading multiple questions, you can easily achieve multi-turn dialogue and receive the following response:

{
  "id": "msg_20260426225208b95324e9945a48d3",
  "model": "glm-4.5-air",
  "object": "chat.completion",
  "created": 1777215128,
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "You said: **\"Hello\"** 😊\n\nLet me know if you need anything else!"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 48,
    "completion_tokens": 37,
    "total_tokens": 85
  }
}

As you can see, the information contained in choices is consistent with basic usage, and the model provides a reply based on the complete dialogue history, thus supporting multi-turn contextual interaction.

¶ System Prompt

You can add a message with role as system at the beginning of messages to constrain the model's role, style, or behavior:

payload = {
    "model": "glm-4.7",
    "messages": [
        {"role": "system", "content": "You are a senior Chinese writing assistant, please respond in a concise and professional tone."},
        {"role": "user", "content": "Please introduce the GLM model in three sentences."}
    ]
}

¶ Tool Calling

The GLM model supports OpenAI-compatible Function Calling, and you can declare callable functions through the tools parameter. The model will return structured function call information in choices[i].message.tool_calls when needed.

payload = {
    "model": "glm-4.7",
    "messages": [
        {"role": "user", "content": "What is the weather like in Beijing today?"}
    ],
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Query the weather for a specified city",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "city": {"type": "string", "description": "City name"}
                    },
                    "required": ["city"]
                }
            }
        }
    ]
}

If the model decides to call a tool, the finish_reason in the returned result will change to tool_calls, and the function name and parameters in JSON string form will be provided in message.tool_calls. You can execute that function and return the result as a message with role as tool back to the model, thus completing the full tool calling loop.

¶ Model Selection Recommendations

| Model           | Applicable Scenarios                          |
| --------------- | --------------------------------------------- |
| `glm-5.1`      | The strongest overall capability, recommended for complex reasoning and long document analysis |
| `glm-4.7`      | Tool invocation, code generation, Agent orchestration, and other tasks |
| `glm-4.6`      | A balanced choice for general conversation and content creation |
| `glm-4.5-air`  | Lightweight, low latency, suitable for high-concurrency customer service and Q&A scenarios |
| `glm-3-turbo`  | General text generation tasks, cost-sensitive scenarios |

## Error Handling

When calling the API, if an error occurs, the API will return the corresponding error code and message. For example:

- `400 token_mismatched`: Missing or invalid request parameters.
- `400 api_not_implemented`: Used unsupported parameters or models.
- `401 invalid_token`: Unauthorized, Bearer Token is missing or invalid.
- `429 too_many_requests`: Triggered rate limit, please try again later.
- `500 api_error`: Internal server error or upstream temporarily unavailable.

### Error Response Example

```json
{
  "trace_id": "69ea9bcf-c5da-41a3-be97-c80912a08523",
  "error": {
    "code": "api_error",
    "message": "Service is temporarily unavailable, please retry later."
  }
}

When api_error is returned and the message is Service is temporarily unavailable, please retry later., it usually indicates that the upstream GLM service is temporarily unavailable. It is recommended to retry with exponential backoff or switch to another available GLM model (for example, temporarily switch from glm-5.1 to glm-4.7 or glm-4.5-air).

¶ Conclusion

Through this document, you have learned how to use the GLM Chat Completion API to call Zhipu AI's GLM series models, including basic calls, streaming responses, multi-turn conversations, system prompts, and tool invocation typical usages. We hope this document helps you better integrate and use the API. If you have any questions, please feel free to contact our technical support team.