How to stream runnables
This guide assumes familiarity with the following concepts:
Streaming is critical in making applications based on LLMs feel responsive to end-users.
Important LangChain primitives like chat models, output parsers, prompts, retrievers, and agents implement the LangChain Runnable Interface.
This interface provides two general approaches to stream content:
- sync
stream
and asyncastream
: a default implementation of streaming that streams the final output from the chain. - async
astream_events
and asyncastream_log
: these provide a way to stream both intermediate steps and final output from the chain.
Let's take a look at both approaches, and try to understand how to use them.
For a higher-level overview of streaming techniques in LangChain, see this section of the conceptual guide.
Using Streamβ
All Runnable
objects implement a sync method called stream
and an async variant called astream
.
These methods are designed to stream the final output in chunks, yielding each chunk as soon as it is available.
Streaming is only possible if all steps in the program know how to process an input stream; i.e., process an input chunk one at a time, and yield a corresponding output chunk.
The complexity of this processing can vary, from straightforward tasks like emitting tokens produced by an LLM, to more challenging ones like streaming parts of JSON results before the entire JSON is complete.
The best place to start exploring streaming is with the single most important components in LLMs apps-- the LLMs themselves!
LLMs and Chat Modelsβ
Large language models and their chat variants are the primary bottleneck in LLM based apps.
Large language models can take several seconds to generate a complete response to a query. This is far slower than the ~200-300 ms threshold at which an application feels responsive to an end user.
The key strategy to make the application feel more responsive is to show intermediate progress; viz., to stream the output from the model token by token.
We will show examples of streaming using a chat model. Choose one from the options below:
- OpenAI
- Anthropic
- Azure
- Cohere
- NVIDIA
- FireworksAI
- Groq
- MistralAI
- TogetherAI
pip install -qU langchain-openai
import getpass
import os
os.environ["OPENAI_API_KEY"] = getpass.getpass()
from langchain_openai import ChatOpenAI
model = ChatOpenAI(model="gpt-4o-mini")
pip install -qU langchain-anthropic
import getpass
import os
os.environ["ANTHROPIC_API_KEY"] = getpass.getpass()
from langchain_anthropic import ChatAnthropic
model = ChatAnthropic(model="claude-3-5-sonnet-20240620")
pip install -qU langchain-openai
import getpass
import os
os.environ["AZURE_OPENAI_API_KEY"] = getpass.getpass()
from langchain_openai import AzureChatOpenAI
model = AzureChatOpenAI(
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
azure_deployment=os.environ["AZURE_OPENAI_DEPLOYMENT_NAME"],
openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
)
pip install -qU langchain-google-vertexai
import getpass
import os
os.environ["GOOGLE_API_KEY"] = getpass.getpass()
from langchain_google_vertexai import ChatVertexAI
model = ChatVertexAI(model="gemini-1.5-flash")
pip install -qU langchain-cohere
import getpass
import os
os.environ["COHERE_API_KEY"] = getpass.getpass()
from langchain_cohere import ChatCohere
model = ChatCohere(model="command-r-plus")
pip install -qU langchain-nvidia-ai-endpoints
import getpass
import os
os.environ["NVIDIA_API_KEY"] = getpass.getpass()
from langchain import ChatNVIDIA
model = ChatNVIDIA(model="meta/llama3-70b-instruct")
pip install -qU langchain-fireworks
import getpass
import os
os.environ["FIREWORKS_API_KEY"] = getpass.getpass()
from langchain_fireworks import ChatFireworks
model = ChatFireworks(model="accounts/fireworks/models/llama-v3p1-70b-instruct")
pip install -qU langchain-groq
import getpass
import os
os.environ["GROQ_API_KEY"] = getpass.getpass()
from langchain_groq import ChatGroq
model = ChatGroq(model="llama3-8b-8192")
pip install -qU langchain-mistralai
import getpass
import os
os.environ["MISTRAL_API_KEY"] = getpass.getpass()
from langchain_mistralai import ChatMistralAI
model = ChatMistralAI(model="mistral-large-latest")
pip install -qU langchain-openai
import getpass
import os
os.environ["TOGETHER_API_KEY"] = getpass.getpass()
from langchain_openai import ChatOpenAI
model = ChatOpenAI(
base_url="https://api.together.xyz/v1",
api_key=os.environ["TOGETHER_API_KEY"],
model="mistralai/Mixtral-8x7B-Instruct-v0.1",
)
Let's start with the sync stream
API:
chunks = []
for chunk in model.stream("what color is the sky?"):
chunks.append(chunk)
print(chunk.content, end="|", flush=True)
The| sky| appears| blue| during| the| day|.|
Alternatively, if you're working in an async environment, you may consider using the async astream
API:
chunks = []
async for chunk in model.astream("what color is the sky?"):
chunks.append(chunk)
print(chunk.content, end="|", flush=True)
The| sky| appears| blue| during| the| day|.|
Let's inspect one of the chunks
chunks[0]
AIMessageChunk(content='The', id='run-b36bea64-5511-4d7a-b6a3-a07b3db0c8e7')
We got back something called an AIMessageChunk
. This chunk represents a part of an AIMessage
.
Message chunks are additive by design -- one can simply add them up to get the state of the response so far!
chunks[0] + chunks[1] + chunks[2] + chunks[3] + chunks[4]
AIMessageChunk(content='The sky appears blue during', id='run-b36bea64-5511-4d7a-b6a3-a07b3db0c8e7')
Chainsβ
Virtually all LLM applications involve more steps than just a call to a language model.
Let's build a simple chain using LangChain Expression Language
(LCEL
) that combines a prompt, model and a parser and verify that streaming works.
We will use StrOutputParser
to parse the output from the model. This is a simple parser that extracts the content
field from an AIMessageChunk
, giving us the token
returned by the model.
LCEL is a declarative way to specify a "program" by chainining together different LangChain primitives. Chains created using LCEL benefit from an automatic implementation of stream
and astream
allowing streaming of the final output. In fact, chains created with LCEL implement the entire standard Runnable interface.
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
prompt = ChatPromptTemplate.from_template("tell me a joke about {topic}")
parser = StrOutputParser()
chain = prompt | model | parser
async for chunk in chain.astream({"topic": "parrot"}):
print(chunk, end="|", flush=True)
Here|'s| a| joke| about| a| par|rot|:|
A man| goes| to| a| pet| shop| to| buy| a| par|rot|.| The| shop| owner| shows| him| two| stunning| pa|rr|ots| with| beautiful| pl|um|age|.|
"|There|'s| a| talking| par|rot| an|d a| non|-|talking| par|rot|,"| the| owner| says|.| "|The| talking| par|rot| costs| $|100|,| an|d the| non|-|talking| par|rot| is| $|20|."|
The| man| says|,| "|I|'ll| take| the| non|-|talking| par|rot| at| $|20|."|
He| pays| an|d leaves| with| the| par|rot|.| As| he|'s| walking| down| the| street|,| the| par|rot| looks| up| at| him| an|d says|,| "|You| know|,| you| really| are| a| stupi|d man|!"|
The| man| is| stun|ne|d an|d looks| at| the| par|rot| in| dis|bel|ief|.| The| par|rot| continues|,| "|Yes|,| you| got| r|ippe|d off| big| time|!| I| can| talk| just| as| well| as| that| other| par|rot|,| an|d you| only| pai|d $|20| |for| me|!"|
Note that we're getting streaming output even though we're using parser
at the end of the chain above. The parser
operates on each streaming chunk individidually. Many of the LCEL primitives also support this kind of transform-style passthrough streaming, which can be very convenient when constructing apps.
Custom functions can be designed to return generators, which are able to operate on streams.
Certain runnables, like prompt templates and chat models, cannot process individual chunks and instead aggregate all previous steps. Such runnables can interrupt the streaming process.
The LangChain Expression language allows you to separate the construction of a chain from the mode in which it is used (e.g., sync/async, batch/streaming etc.). If this is not relevant to what you're building, you can also rely on a standard imperative programming approach by
caling invoke
, batch
or stream
on each component individually, assigning the results to variables and then using them downstream as you see fit.
Working with Input Streamsβ
What if you wanted to stream JSON from the output as it was being generated?
If you were to rely on json.loads
to parse the partial json, the parsing would fail as the partial json wouldn't be valid json.
You'd likely be at a complete loss of what to do and claim that it wasn't possible to stream JSON.
Well, turns out there is a way to do it -- the parser needs to operate on the input stream, and attempt to "auto-complete" the partial json into a valid state.
Let's see such a parser in action to understand what this means.
from langchain_core.output_parsers import JsonOutputParser
chain = (
model | JsonOutputParser()
) # Due to a bug in older versions of Langchain, JsonOutputParser did not stream results from some models
async for text in chain.astream(
"output a list of the countries france, spain and japan and their populations in JSON format. "
'Use a dict with an outer key of "countries" which contains a list of countries. '
"Each country should have the key `name` and `population`"
):
print(text, flush=True)
{}
{'countries': []}
{'countries': [{}]}
{'countries': [{'name': ''}]}
{'countries': [{'name': 'France'}]}
{'countries': [{'name': 'France', 'population': 67}]}
{'countries': [{'name': 'France', 'population': 67413}]}
{'countries': [{'name': 'France', 'population': 67413000}]}
{'countries': [{'name': 'France', 'population': 67413000}, {}]}
{'countries': [{'name': 'France', 'population': 67413000}, {'name': ''}]}
{'countries': [{'name': 'France', 'population': 67413000}, {'name': 'Spain'}]}
{'countries': [{'name': 'France', 'population': 67413000}, {'name': 'Spain', 'population': 47}]}
{'countries': [{'name': 'France', 'population': 67413000}, {'name': 'Spain', 'population': 47351}]}
{'countries': [{'name': 'France', 'population': 67413000}, {'name': 'Spain', 'population': 47351567}]}
{'countries': [{'name': 'France', 'population': 67413000}, {'name': 'Spain', 'population': 47351567}, {}]}
{'countries': [{'name': 'France', 'population': 67413000}, {'name': 'Spain', 'population': 47351567}, {'name': ''}]}
{'countries': [{'name': 'France', 'population': 67413000}, {'name': 'Spain', 'population': 47351567}, {'name': 'Japan'}]}
{'countries': [{'name': 'France', 'population': 67413000}, {'name': 'Spain', 'population': 47351567}, {'name': 'Japan', 'population': 125}]}
{'countries': [{'name': 'France', 'population': 67413000}, {'name': 'Spain', 'population': 47351567}, {'name': 'Japan', 'population': 125584}]}
{'countries': [{'name': 'France', 'population': 67413000}, {'name': 'Spain', 'population': 47351567}, {'name': 'Japan', 'population': 125584000}]}
Now, let's break streaming. We'll use the previous example and append an extraction function at the end that extracts the country names from the finalized JSON.
Any steps in the chain that operate on finalized inputs rather than on input streams can break streaming functionality via stream
or astream
.
Later, we will discuss the astream_events
API which streams results from intermediate steps. This API will stream results from intermediate steps even if the chain contains steps that only operate on finalized inputs.
from langchain_core.output_parsers import (
JsonOutputParser,
)
# A function that operates on finalized inputs
# rather than on an input_stream
def _extract_country_names(inputs):
"""A function that does not operates on input streams and breaks streaming."""
if not isinstance(inputs, dict):
return ""
if "countries" not in inputs:
return ""
countries = inputs["countries"]
if not isinstance(countries, list):
return ""
country_names = [
country.get("name") for country in countries if isinstance(country, dict)
]
return country_names
chain = model | JsonOutputParser() | _extract_country_names
async for text in chain.astream(
"output a list of the countries france, spain and japan and their populations in JSON format. "
'Use a dict with an outer key of "countries" which contains a list of countries. '
"Each country should have the key `name` and `population`"
):
print(text, end="|", flush=True)
['France', 'Spain', 'Japan']|
Generator Functionsβ
Let's fix the streaming using a generator function that can operate on the input stream.
A generator function (a function that uses yield
) allows writing code that operates on input streams
from langchain_core.output_parsers import JsonOutputParser
async def _extract_country_names_streaming(input_stream):
"""A function that operates on input streams."""
country_names_so_far = set()
async for input in input_stream:
if not isinstance(input, dict):
continue
if "countries" not in input:
continue
countries = input["countries"]
if not isinstance(countries, list):
continue
for country in countries:
name = country.get("name")
if not name:
continue
if name not in country_names_so_far:
yield name
country_names_so_far.add(name)
chain = model | JsonOutputParser() | _extract_country_names_streaming
async for text in chain.astream(
"output a list of the countries france, spain and japan and their populations in JSON format. "
'Use a dict with an outer key of "countries" which contains a list of countries. '
"Each country should have the key `name` and `population`",
):
print(text, end="|", flush=True)
France|Spain|Japan|
Because the code above is relying on JSON auto-completion, you may see partial names of countries (e.g., Sp
and Spain
), which is not what one would want for an extraction result!
We're focusing on streaming concepts, not necessarily the results of the chains.
Non-streaming componentsβ
Some built-in components like Retrievers do not offer any streaming
. What happens if we try to stream
them? π€¨
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import OpenAIEmbeddings
template = """Answer the question based only on the following context:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
vectorstore = FAISS.from_texts(
["harrison worked at kensho", "harrison likes spicy food"],
embedding=OpenAIEmbeddings(),
)
retriever = vectorstore.as_retriever()
chunks = [chunk for chunk in retriever.stream("where did harrison work?")]
chunks
[[Document(page_content='harrison worked at kensho'),
Document(page_content='harrison likes spicy food')]]
Stream just yielded the final result from that component.
This is OK π₯Ή! Not all components have to implement streaming -- in some cases streaming is either unnecessary, difficult or just doesn't make sense.
An LCEL chain constructed using non-streaming components, will still be able to stream in a lot of cases, with streaming of partial output starting after the last non-streaming step in the chain.
retrieval_chain = (
{
"context": retriever.with_config(run_name="Docs"),
"question": RunnablePassthrough(),
}
| prompt
| model
| StrOutputParser()
)
for chunk in retrieval_chain.stream(
"Where did harrison work? " "Write 3 made up sentences about this place."
):
print(chunk, end="|", flush=True)
Base|d on| the| given| context|,| Harrison| worke|d at| K|ens|ho|.|
Here| are| |3| |made| up| sentences| about| this| place|:|
1|.| K|ens|ho| was| a| cutting|-|edge| technology| company| known| for| its| innovative| solutions| in| artificial| intelligence| an|d data| analytics|.|
2|.| The| modern| office| space| at| K|ens|ho| feature|d open| floor| plans|,| collaborative| work|sp|aces|,| an|d a| vib|rant| atmosphere| that| fos|tere|d creativity| an|d team|work|.|
3|.| With| its| prime| location| in| the| heart| of| the| city|,| K|ens|ho| attracte|d top| talent| from| aroun|d the| worl|d,| creating| a| diverse| an|d dynamic| work| environment|.|
Now that we've seen how stream
and astream
work, let's venture into the world of streaming events. ποΈ
Using Stream Eventsβ
Event Streaming is a beta API. This API may change a bit based on feedback.
This guide demonstrates the V2
API and requires langchain-core >= 0.2. For the V1
API compatible with older versions of LangChain, see here.
import langchain_core
langchain_core.__version__
For the astream_events
API to work properly:
- Use
async
throughout the code to the extent possible (e.g., async tools etc) - Propagate callbacks if defining custom functions / runnables
- Whenever using runnables without LCEL, make sure to call
.astream()
on LLMs rather than.ainvoke
to force the LLM to stream tokens. - Let us know if anything doesn't work as expected! :)
Event Referenceβ
Below is a reference table that shows some events that might be emitted by the various Runnable objects.
When streaming is implemented properly, the inputs to a runnable will not be known until after the input stream has been entirely consumed. This means that inputs
will often be included only for end
events and rather than for start
events.
event | name | chunk | input | output |
---|---|---|---|---|
on_chat_model_start | [model name] | {"messages": [[SystemMessage, HumanMessage]]} | ||
on_chat_model_stream | [model name] | AIMessageChunk(content="hello") | ||
on_chat_model_end | [model name] | {"messages": [[SystemMessage, HumanMessage]]} | AIMessageChunk(content="hello world") | |
on_llm_start | [model name] | {'input': 'hello'} | ||
on_llm_stream | [model name] | 'Hello' | ||
on_llm_end | [model name] | 'Hello human!' | ||
on_chain_start | format_docs | |||
on_chain_stream | format_docs | "hello world!, goodbye world!" | ||
on_chain_end | format_docs | [Document(...)] | "hello world!, goodbye world!" | |
on_tool_start | some_tool | {"x": 1, "y": "2"} | ||
on_tool_end | some_tool | {"x": 1, "y": "2"} | ||
on_retriever_start | [retriever name] | {"query": "hello"} | ||
on_retriever_end | [retriever name] | {"query": "hello"} | [Document(...), ..] | |
on_prompt_start | [template_name] | {"question": "hello"} | ||
on_prompt_end | [template_name] | {"question": "hello"} | ChatPromptValue(messages: [SystemMessage, ...]) |