Chat & Generazione#

Scopri come chattare con un LLM in Xinference.

Introduzione#

I modelli con capacità di chat o generate sono generalmente chiamati modelli linguistici di grandi dimensioni (LLM) o modelli di generazione testuale. Questi modelli sono progettati per rispondere all’input ricevuto producendo output testuale, comunemente chiamato «prompt». In generale, è possibile guidare questi modelli a completare compiti tramite istruzioni specifiche o fornendo esempi concreti.

I modelli con capacità generate sono tipicamente modelli linguistici di grandi dimensioni pre-addestrati. D’altra parte, i modelli dotati della funzionalità chat sono LLM (Language Model) ottimizzati e allineati, appositamente perfezionati per scenari di dialogo. Nella maggior parte dei casi, i modelli che terminano con «chat» (ad esempio llama-2-chat, qwen-chat, ecc.) possiedono la funzionalità chat.

L’API Chat e l’API Generate offrono due metodi diversi per interagire con gli LLM:

API Chat (simile alla Chat Completion API di OpenAI) supporta conversazioni multi-turno.
Generate API (simile alla Completions API di OpenAI) consente di generare testo a partire da un prompt testuale.

capacità del modello	endpoint API	Endpointi compatibili con OpenAI
chat	Chat API	/v1/chat/completions
generate	Generate API	/v1/completions

Elenco dei modelli supportati#

Puoi visualizzare le capacità di tutti i modelli LLM integrati in Xinference.

Modello di chat#

Chat API#

Prova a utilizzare cURL, il client OpenAI o il client Python di Xinference per testare la Chat API:

curl -X 'POST' \
  'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "<MODEL_UID>",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "What is the largest animal?"
        }
    ],
    "max_tokens": 512,
    "temperature": 0.7
  }'

import openai

client = openai.Client(
    api_key="cannot be empty",
    base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
client.chat.completions.create(
    model="<MODEL_UID>",
    messages=[
        {
            "content": "What is the largest animal?",
            "role": "user",
        }
    ],
    max_tokens=512,
    temperature=0.7
)

from xinference.client import RESTfulClient

client = RESTfulClient("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
messages = [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the largest animal?"}]
model.chat(
    messages,
    generate_config={
      "max_tokens": 512,
      "temperature": 0.7
    }
)

{
  "id": "chatcmpl-8d76b65a-bad0-42ef-912d-4a0533d90d61",
  "model": "<MODEL_UID>",
  "object": "chat.completion",
  "created": 1688919187,
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The largest animal that has been scientifically measured is the blue whale, which has a maximum length of around 23 meters (75 feet) for adult animals and can weigh up to 150,000 pounds (68,000 kg). However, it is important to note that this is just an estimate and that the largest animal known to science may be larger still. Some scientists believe that the largest animals may not have a clear \"size\" in the same way that humans do, as their size can vary depending on the environment and the stage of their life."
      },
      "finish_reason": "None"
    }
  ],
  "usage": {
    "prompt_tokens": -1,
    "completion_tokens": -1,
    "total_tokens": -1
  }
}

Puoi trovare ulteriori esempi di Chat API nel notebook tutorial.

Gradio Chat

Esempio di come utilizzare l’API Chat di Xinference e il client Python.

https://github.com/xorbitsai/inference/blob/main/examples/gradio_chatinterface.py

Modello di pensiero misto#

Alcuni modelli linguistici di grandi dimensioni sono contrassegnati come misti e possono essere configurati per utilizzare la modalità di pensiero.

Added in version v1.17.0: L’interruttore enable_thinking a livello di richiesta è supportato dalla v1.17.0.

Xinference fornisce un interruttore enable_thinking a livello di richiesta, applicabile a diversi modelli di template (ad esempio, Qwen utilizza enable_thinking, mentre alcuni template DeepSeek usano thinking).

Esempi di utilizzo:

curl -X 'POST' \
  'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "<MODEL_UID>",
    "messages": [
        {"role": "user", "content": "What is the largest animal?"}
    ],
    "enable_thinking": false
  }'

import openai

client = openai.Client(
    api_key="cannot be empty",
    base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
client.chat.completions.create(
    model="<MODEL_UID>",
    messages=[
        {"role": "user", "content": "What is the largest animal?"}
    ],
    extra_body={"enable_thinking": False}
)

from xinference.client import RESTfulClient

client = RESTfulClient("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
model.chat(
    [{"role": "user", "content": "What is the largest animal?"}],
    enable_thinking=False,
)

model.chat(
    [{"role": "user", "content": "What is the largest animal?"}],
    generate_config={"chat_template_kwargs": {"enable_thinking": False}},
)

Modello generativo#

Generate API#

Generate API replica dell”API Completions di OpenAI.

La differenza principale tra Generate API e Chat API risiede nella forma dell’input. Chat API accetta una lista di messaggi come input, mentre Generate API accetta una stringa di testo libero chiamata prompt come input.

curl -X 'POST' \
  'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "<MODEL_UID>",
    "prompt": "What is the largest animal?",
    "max_tokens": 512,
    "temperature": 0.7
  }'

import openai

client = openai.Client(api_key="cannot be empty", base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1")
client.chat.completions.create(
    model=("<MODEL_UID>",
    messages=[
        {"role": "user", "content": "What is the largest animal?"}
    ],
    max_tokens=512,
    temperature=0.7
)

from xinference.client import RESTfulClient

client = RESTfulClient("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
print(model.generate(
    prompt="What is the largest animal?",
    generate_config={
      "max_tokens": 512,
      "temperature": 0.7
    }
))

{
  "id": "cmpl-8d76b65a-bad0-42ef-912d-4a0533d90d61",
  "model": "<MODEL_UID>",
  "object": "text_completion",
  "created": 1688919187,
  "choices": [
    {
      "index": 0,
      "text": "The largest animal that has been scientifically measured is the blue whale, which has a maximum length of around 23 meters (75 feet) for adult animals and can weigh up to 150,000 pounds (68,000 kg). However, it is important to note that this is just an estimate and that the largest animal known to science may be larger still. Some scientists believe that the largest animals may not have a clear \"size\" in the same way that humans do, as their size can vary depending on the environment and the stage of their life.",
      "finish_reason": "None"
    }
  ],
  "usage": {
    "prompt_tokens": -1,
    "completion_tokens": -1,
    "total_tokens": -1
  }
}

FAQ#

Xinference offre metodi di integrazione dell’LLM con LangChain o LlamaIndex?#

Sì, puoi fare riferimento alle sezioni pertinenti nei rispettivi documenti ufficiali di Xinference. Ecco i link: