Kinara Ara-2 Processor Hits 12 Tokens Per Second Running 7 Billion Parameter LLMs

Generative AI capabilities of this leading-edge AI processor are demonstrated in new video available on YouTube

 Santa Clara, CA — August 8, 2024 — Kinara™, Inc., today has proven that its low-power, low-cost AI processor, the Kinara Ara-2, has mastered the heavy demands of accurately and efficiently running Generative AI applications such as Large Language Models (LLMs) at the edge. Specifically, the company is demonstrating the flawless operation of the Qwen1.5-7B model running on a single Ara-2 processor at 12 output tokens per second. This capability, depicted in the new online video entitled ‘Kinara Ara-2 Masters Local LLM Chatbot’, is an important accomplishment because LLMs, and Generative AI in general, must be run on the edge to ensure data privacy and reduce latency by removing the need for Internet access. Furthermore, with Generative AI processing at the edge, the user only pays a one-time cost for the integrated hardware in their personal computers and avoids expensive cloud usage costs. Generative AI processing at the edge increases the functionality of PCs, offering users the ability to perform documentation summarization, transcription, translation, and other beneficial and time-saving tasks.

Qwen, available as open source under the Apache 2.0 license and backed by Alibaba Cloud (Tongyi Qianwen), is like LLaMA2, and represents a series of models across diverse sizes (e.g., 0.5B, 4B, 7B, 14B, 32B, 72B) and various functions including chat, language understanding, reasoning, math, and coding. From a Natural Language Processing (NLP) perspective, Qwen can be used to process commands that a user performs in day-to-day operations on their computer. And unlike the voice command processing typically available in cars, Qwen and other Generative AI chat models are multilingual, accurate, and are not restricted to specific text sequences.

Beyond generating simple and complex output text prompts at 12 tokens per second, effectively running Qwen1.5-7B and any other LLM on the edge requires the Kinara Ara-2 to support three high-level features: 1) the ability to aggressively quantize LLMs and other generative AI workloads while still delivering near floating-point accuracy; 2) extreme flexibility and capability to run all LLM operators end-to-end without relying on the host (this includes all model layers and activation functions); and 3) sufficient memory size and bandwidth to effectively handle these extremely large neural networks.

“Running any LLM on a low-power edge AI processor is quite a feat but hitting 12 output tokens per second on a 7B parameter LLM is a major accomplishment,” said Wajahat Qadeer, Kinara’s chief architect. “However, the best is yet to come, as we are on target to hit 15 output tokens per second by applying advanced software techniques while leaving the model itself unmodified.”

With existing LLMs and new LLMs that become available on Hugging Face and elsewhere, Kinara can quickly bring up these models by leveraging its innovative software and architectural flexibility, executing these models with floating-point accuracy, while offering the low power dissipation of an integer processor. And beyond Generative AI applications, Ara-2 is very capable of handling 16-32+ video streams fed into edge servers for high-end object detection, recognition, and tracking, using its advanced compute engines to process higher resolution images quickly and with high accuracy. Ara-2 is available as a stand-alone device, a USB module, an M.2 module, and a PCIe card featuring multiple Ara-2’s.

Interested parties are invited to contact Kinara directly to see for themselves the Qwen1.5-7B and other LLM applications running on Ara-2.

***Ends***

About Kinara

Kinara provides the world’s most power- and price-efficient Edge AI inference platform supported by comprehensive AI software development tools. Enabling Generative AI and smart applications across retail, medical, industry 4.0, automotive, and smart cities, Kinara’s AI processors, modules, and software can be found at the heart of the AI industry’s most exciting and influential innovations. Kinara envisions a world of exceptional customer experiences, better manufacturing efficiency, and greater safety for all. Learn more at https://kinara.ai/ 

 

Napier Partnership:

Nesbert Musuwo, Account Manager, Napier B2B

Email Address: Nesbert@Napierb2b.com

 

All registered trademarks and other trademarks belong to their respective owners.

Kinara Edge AI processor tackles the monstrous compute demands of Generative AI and transformer-based models

New Ara-2 second-generation processor, built around the same flexible and efficient architecture as Ara-1, boasts huge increase in performance and a significant boost in performance/Watt and performance/$

 

Los Altos, CA — December 12, 2023 — Kinara™, Inc., today launched the Kinara Ara-2 Edge AI processor, powering edge servers and laptops with high performance, cost effective, and energy efficient inference to run applications such as video analytics, Large Language Models (LLMs), Latent Diffusion Models (LDMs) and other Generative AI models. The Ara-2 is also ideal for edge applications running traditional AI models and state-of-the-art AI models with transformer-based architectures. With an experientially enhanced feature set and more than 5-8 times improvement in performance over the first-generation Ara-1 processor, Kinara’s Ara-2 combines real-time responsiveness with high throughput, merging its proven latency optimized design with perfectly balanced on-chip memories and high off-chip bandwidth to execute very large models with extremely low latency.

LLMs and Generative AI in general have become incredibly popular, but most of the associated applications are running on GPUs in data centers and are burdened with high latency, high cost, and questionable privacy. To overcome these limitations and put the compute literally in the hands of the user, Kinara’s Ara-2 simplifies the transition to the edge with its support for Generative AI models with 10’s of billions of parameters. Furthermore, to seamlessly facilitate the migration from expensive GPUs for a wide variety of AI models, the compute engines in Ara-2 and the associated software development kit (SDK), are specifically designed to support high-accuracy quantization, a dynamically moderated host runtime, and direct FP32 support.

“With Ara-2 added to our family of processors, we can better provide customers with performance and cost options to meet their requirements. For example, Ara-1 is the right solution for smart cameras as well as edge AI appliances with 2-8 video streams, whereas Ara-2 is strongly suited for handling 16-32+ video streams fed into edge servers, as well as Generative AI workloads on laptops, and even high-end cameras,” said Ravi Annavajjhala, Kinara’s CEO. “The Ara-2 enables better object detection, recognition, and tracking by using its advanced compute engines to process higher resolution images more quickly and with significantly higher accuracy. And as an example of its capabilities for processing Generative AI models, Ara-2 can hit 10 seconds per image for Stable Diffusion and tens of tokens/sec for LLaMA-7B.

In October, Ampere welcomed Kinara into the AI Platform Alliance with the primary goal of reducing system complexity and promoting better collaboration and openness with AI solutions and ultimately delivering better total performance and increased power and cost efficiency than GPUs. “Ampere’s Chief Evangelist Sean Varley said, “The performance and feature set of Kinara’s Ara-2 is a step in the right direction to help us bring better AI alternatives to the industry than the GPU-based status quo.”

The Ara-2 also offers secure boot, encrypted memory access, and a secure host interface to enable enterprise AI deployments with even greater security. Kinara also supports Ara-2 with a comprehensive SDK that includes a model compiler and compute-unit scheduler, flexible quantization options that include the integrated Kinara quantizer as well as support for pre-quantized PyTorch and TFLite models, a load balancer for multi-chip systems, and a dynamically moderated host runtime.

Ara-2 is available as a stand-alone device, a USB module, an M.2 module, and a PCIe card featuring multiple Ara-2’s. Kinara will show a live demo with Ara-2 at CES. Contact Kinara to set up your appointment in our hospitality suite at the Venetian Hotel on January 9-11, 2024.

***Ends***

About Kinara

Kinara provides the world’s most power- and price-efficient Edge AI inference platform supported by comprehensive AI software development tools. Enabling smart applications across retail, medical, industry 4.0, automotive, and smart cities, Kinara’s AI processors, modules, and software can be found at the heart of the AI industry’s most exciting and influential innovations. Kinara envisions a world of exceptional customer experiences, better manufacturing efficiency, and greater safety for all. Learn more at https://kinara.ai/ 

 

Kinara Contact 

Napier Partnership:

Nesbert Musuwo, Account Manager, Napier B2B

Email Address: Nesbert@Napierb2b.com

All registered trademarks and other trademarks belong to their respective owners.