AI — weekly megathread! – Jay van Zyl @ ecosystem.Ai

News provided by aibrews.com

Google introduced Gemini - a family of multimodal models built from the ground up for multimodality, capable of reasoning seamlessly across text, images, video, audio, and code. It comes in Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases [Details | Technical Report].
With a score of 90.0%, Gemini Ultra is the first model to outperform human experts on MMLU (massive multitask language understanding). Gemini Pro is available in Bard (English, in 170 countries). Gemini Ultra will come to Bard early next year. Pixel 8 Pro will be able to run Gemini Nano.
Controversy regarding Google’s demo video (below), as many took it as being ‘fake’ [Article on TechCrunch]. Google shared a link to their blog post titled ‘How it’s Made: Interacting with Gemini through multimodal prompting’ in the video description [Link].
Meta AI announced Purple Llama — an umbrella project that, over time, will bring together tools and evaluations to help the community build responsibly with open generative AI models [Details].
1. The initial release include CyberSec Eval, a set of cybersecurity safety evaluations benchmarks for LLMs; and Llama Guard, a safety classifier for input/output filtering that is optimized for ease of deployment.
2. Components within the Purple Llama project will be licensed permissively, enabling both research and commercial usage
Nexusflow released NexusRaven V2, an open-source 13B function calling LLM that surpasses GPT-4 by up to 7% in function calling success rates. NexusRaven V2 was instruction-tuned from Meta’s CodeLlama-13B, without using proprietary LLM generated data. It is commercially permissive for both community developers and enterprises [Details].
Meta introduced Audiobox, a new foundation research model for audio generation. Audiobox can generate voices and sound effects using a combination of voice inputs and natural language text prompts. Audiobox is the first model to enable dual input (voice prompts and text description prompts) for freeform voice restyling. Users can combine an audio voice input with a text style prompt to synthesize speech of that voice in any environment (e.g., “in a cathedral”) or any emotion (e.g., “speaks sadly and slowly”) [Details].
Playground released Playground v2, a new open-source diffusion-based text-to-image generative model, with commercial use permitted. Early benchmarks show Playground v2 is preferred 2.5x more than Stable Diffusion XL [Details].
Stability AI released StableLM Zephyr 3B: a new 3 billion chat model preference tuned for instruction following and Q&A-type tasks. This model is an extension of the pre-existing StableLM 3B-4e1t model and is inspired by the Zephyr 7B model from HuggingFace [Details].
Apple machine learning research released MLX, an open-source PyTorch-style machine learning framework specifically designed for Apple silicon [Details | Examples].
Google presented AlphaCode 2, a competitive coding model finetuned from Gemini, which excels at solving competitive programming problems that go beyond coding to involve complex math and theoretical computer science [Details].
Alibaba Cloud released Qwen-72B (trained on 3T tokens and 32k context) and Qwen-1.8B(2K-length text content with 3GB of GPU memory), including Base, Chat and Quantized versions [Details].
Microsoft Research introduced LLMLingua, a prompt-compression method that identifies and removes unimportant tokens from prompts. Although the token-level compressed prompts may be difficult for humans to understand, they prove highly effective for LLMs. It has been integrated into LlamaIndex [Details].
Scale AI introduced Automotive Foundation Model, AFM-1. It is a SOTA language-grounded perception model for autonomous vehicles [Details].
Microsoft launched Seeing AI a free app for low-vision and blind users on Android, after launching earlier on iOS, with updated features and new languages [Details].
Anthropic released a new dataset for measuring discrimination across 70 different potential applications of language models, including loan applications, visa approvals, and security clearances [Paper | Hugging Face].
IBM and Meta launched the AI Alliance, an international community of 50+ leading organizations across industry, academia and research to collaborate for the advancement of open, safe, responsible AI [Details].
Researchers from Bytedance released MagicAnimate, a diffusion-based framework for human image animation that significantly improves upon existing methods. You can try the demo here [Details ].
Institute for Intelligent Computing, Alibaba Group introduced Animate Anyone, a method of transforming character images into animated videos controlled by desired pose sequences [Details].
Microsoft Research announced MatterGen, a generative model that enables broad property-guided materials design by directly generating novel materials with desired properties, similar to how DALL·E 3 tackles image generation [Details].
Meta is testing more than 20 new ways generative AI can improve users’ experiences across Facebook, Instagram, Messenger, and WhatsApp. Imagine (text-to-image generation tool, powered by Meta’s Emu model), has now been released as a stand-alone web app [Details].
Runway is partnering with Getty Images to launch a new video model, Runway Getty Images Model (RGM) for enterprise customers to fine-tune it using their own proprietary datasets [Details].
Meta announced Ego-Exo4D: a foundational dataset and benchmark suite focused on skilled human activities to support research on video learning and multimodal perception. It's the largest ever public dataset of its kind [Details].
X begins rolling out Grok, its ‘rebellious’ chatbot, to subscribers [Details].
OpenAI delays launch of custom GPT store to early 2024 [Details].

🔦 Weekly Spotlight

17 Predictions for 2024: From RAG to Riches to Beatlemania and National Treasures [Link].
Self-Operating Computer Framework: A framework to enable multimodal models to operate a computer. Using the same inputs and outputs of a human operator, the model views the screen and decides on a series of mouse and keyboard actions to reach an objective [Link].

submitted by /u/jaketocake
[link] [comments]