Are you ready to generate more awareness for your brand? Consider becoming a sponsor of The AI Impact Tour. Learn more about opportunities here.
Nous Research, a private applied research group known for publishing open source work in the LLM domain, has released a lightweight vision and language model called Nous Hermes 2 Vision.
Available through Hugging Face, the open source model is based on the company’s previous OpenHermes-2.5-Mistral-7B model and provides vision capabilities, including the ability to display images and extract text information from visual content.
However, shortly after launch, it was discovered that the model was outperforming than expected, leading to technical glitches and the project’s eventual name change to the Hermes 2 Vision Alpha. The company is expected to continue with a more stable version, providing similar benefits but with fewer bugs.
Nous Hermes 2 Vision Alpha
The Nous vision model, named after Hermes, the Greek messenger of the gods, is designed to be a system that navigates “the complex intricacies of human discourse with heavenly finesse.” It leverages image data provided by a user and combines that visual information with its learnings to provide detailed responses in natural language.
The AI Impact Tour
Connect with the enterprise AI community at VentureBeat’s AI Impact Tour coming to a city near you!
For example, you could analyze a user’s image and detail different aspects of what it contains. The co-founder of Nous, who calls himself Teknio in Xshared a test screenshot in which the LLM was able to analyze a photo of a burger and determine if it would be unhealthy for the burger and why.
While ChatGPT, based on GPT-4V, also provides the ability to display images, Nous’ open source offering differentiates itself with two key improvements.
First, unlike traditional approaches that rely on substantial 3B vision encoders, Nous Hermes 2 Vision leverages SigLIP-400M. This not only streamlines the architecture of the model, making it lighter than its counterparts, but also helps improve performance on vision and language tasks.
Second, it has been trained on a custom dataset enriched with function calls. This allows users to give the model a label
“This distinctive addition transforms Nous-Hermes-2-Vision into a vision-language action model. Developers now have a versatile tool at their disposal, ready to create a host of ingenious automations,” the company writes on the model’s Hugging Face page.
Other datasets used to train the model were LVIS-INSTRUCT4V, ShareGPT4V, and OpenHermes-2.5 conversations.
Despite the differentiations, problems persist at this stage
While Nous’ vision-language model is available for research and development, its initial use has shown that it is far from perfect.
Shortly after the launch, the co-founder published a post saying that something was wrong with the model and that it was hallucinating a lot, spamming EOS tokens, etc. The model was later renamed the alpha version.
“I see people talk about ‘hallucinations’ and yes, it’s pretty bad. I also knew this because LLM based is an uncensored model. I will make an updated version of this at the end of the month to resolve these issues,” Quan Nguyen, the researcher leading AI efforts at Nous, wrote on X.
Questions submitted by VentureBeat regarding the issues remained unanswered at the time of writing.
That said, Nguyen noted in another post that the function calling capability still works well if the user defines a good schema. He also said that she will launch a dedicated model for function calls if user feedback is good enough.
So far, Nous Research has released 41 open source models with different architectures and capabilities as part of its Hermes, YaRN, Capybara, Puffin and Obsidian series.
VentureBeat’s mission is to be a digital marketplace for technical decision makers to gain insights into transformative business technology and transact. Discover our Briefings.