Enter your email address below and subscribe to our newsletter
Enter your email address below and subscribe to our newsletter
If you ever worried about AI chatbots not delivering proper texts after the second hour or third hour into the conversation, then you can breathe a sigh of relief, as the new update made by researchers found an effective solution to make the AI chatbots talk for a long time without crashing out.
One of the common problems with AI chatbots is the limitations they suffer after holding a conversation for a long time. The delay or mundanity of the replies from the AI chatbot is a deal breaker to a lot of people, who seek a more engaging interaction with the chatbot. The slowdowns or crashing of the AI chatbot is one of the many issues that made service fall a bit flat in the social world.
Another major weakness of Chatbot AI is the time, after a while the quality of messages will deteriorate, which will then lead to a hard crash. Getting into a conversation that lasts for a long time without crashing or shutting down deemed impossible, but not any longer.
A team of researchers from MIT emerged with a solution to the performance problem of AI chatbots. The fix here will resolve the long running issues in AI Chatbot by allowing them to have conversation non-stop without showing any signs of crash or a hint of slowdowns.
The new upgrade will make the AI chatbots more stronger and reliable to the people who are seeking engaging conversation with their AI chatbot apps without having to hit reboot. Now AI chatbots can engage in a conversation that can go on for long hours without leaving out mundane replies or straight up crashing.
The method here consists of a tweak to the Key-value cache, which acts like a conversation data or memory. Many large language models carry this key-value cache inside.
Before the tweak to the model, if the cache holds more data than its capacity, the first pieces of data will get immediately discarded to make room for the newer one. This removing and adding of data resulted in the entire model deteriorating in performance or crashing down.
The problem here lies in the first two pieces of data, which must have to be removed to make space for the newer ones to register. The key-value cache comes with a limitation to its storage space, reaching the end will immediately result in crashes.
The tweak that solved the problem helped the first few data points to stay in the memory, retaining the first bits of data allowed the chatbot to continue having conversation without crashing or performance decrease.
The tweak is known as StreamingLLM, it has successfully extended the operation to stretch beyond the old model. StreamingLLM allows the conversation to reach over 4 million words. The newer model also performs 22 times faster than the older model, as it avoids crashing by recomputing part of the past conversations.
Here is a short version of what is happening behind the scenes of any AI chatbot that you see popping up on your feed. Large language models encode data into Tokens. Then a model known as Attention mechanism uses the tokens to create new texts.
AI chatbots sees the texts from the user and generates new texts, then it stores the recent tokens in memory, this memory is called KV cache. The existence of KV cache is to allow the chatbot to use the data later on.
When the cache gets bigger, the attention map (mapping out the relationship between each stored token) becomes larger as well. The inflation of cache leads to slow down and it eventually follows with a hard crash.
To work around the problem, researchers from MIT created Sliding cache, which bumps out the oldest token to add the newer ones. But erasing the older problem gave it to the newer one, as eradicating the old token eventually led to reduction in newly generated words.
Which left the researchers with the question – Why is the first token so important to the model to create the newer one?
There is an attention sink attached to the first token, it allows the first word or token to be visible to the rest of the token in the menu. The solution was to keep the attention sink in the cache to carry out the rest of the model.
Higher priority was paid to ensure positional encoding of each token to stay in the same area. To get optimal performance in the mode, StreamingLLM added four attention sink tokens at the start of the sliding cache.
The outcome was successful as the attention sinks allowed the model to stay at a stable performance zone only with one attention sink in the cache rather than relying on four.
Yang You, a presidential young professor of computer science at the National University of Singapore praised StreamingLLM by saying –
“The innovative approach of StreamingLLM, centered around the attention sink mechanism, ensures stable memory usage and performance, even when processing texts up to 4 million tokens in length,”Yang You
He continued with “This capability is not just impressive; it’s transformative, enabling StreamingLLM to be applied across a wide array of AI applications. The performance and versatility of StreamingLLM mark it as a highly promising technology, poised to revolutionize how we approach AI-driven generation applications.”
The problem now revolves around cache, as the model can’t recall words that are not stored inside the cache. The newer methods that are currently in its investigation phase promise to retrieve the evicted or discard tokens from the previous conversations.
Enabling the model to memorize the previous conversation is something that we might see in the future.
The newer Streaming LLM will allow the future chatbot to hold long conversations without forgetting any of the past data. The AI chatbot user will not have to resort to rebooting the entire app for long conversations.
StreamingLLM allows the app to continue holding the tasks such as copywriting, editing, and creating codes easily for a longer time. Guangxuan Xiao, who works as an electrical engineering and computer science [EECS], have written paper on StreamingLLM states the tweak by saying this –
“Now, with this method, we can persistently deploy these large language models. By making a chatbot that we can always chat with, and that can always respond to us based on our recent conversations, we could use these chatbots in some new applications,”Guangxuan Xiao
Source: MIT News
Stay updated with the world of tech, subscribe to Tecxology.