This Week’s Major AI developments in a nutshell (December Week 4, 2023 + January week 1, 2024)

Meta and UC, Berkeley introduced Audio2Photoreal, a framework for generating full-bodied photorealistic avatars with gestures driven from audio of a dyadic conversation [Details | GitHub].
MyShell along with researchers from MIT and Tsinghua University introduced OpenVoice, an open sourcce voice cloning approach that is nearly instantaneous and provides granular control of tone, from emotion to accent, rhythm, pauses, and intonation, using just a small audio clip [Details | Hugging Face] .
Suno and Nvidia present Parakeet, a family of open source speech recognition models that top the Open ASR Leaderboard. Parkeet models effectively prevent the generation of hallucinated transcript and are robust to noisy audio. Available for commercial use under CC BY 4.0 [Details | Hugging Face].
Researchers from Stanford University introduce Mobile-ALOHA, an open-source robot hardware that can can autonomously complete complex mobile manipulation tasks that require whole-body control like cook and serve shrimp, call and take elevator, store a 3Ibs pot to a two-door cabinet etc., with just 50 demos [Details].
Allen Institute for AI released Unified-IO 2 (open-source), the first autoregressive multimodal model that is capable of understanding and generating image, text, audio, and action. The model is pre-trained from scratch on an extensive variety of multimodal data -- 1 billion image-text pairs, 1 trillion text tokens, 180 million video clips, 130 million interleaved image & text, 3 million 3D assets, and 1 million agent trajectories [Details].
Alibaba Research introduced DreamTalk, a diffusion-based audio-driven expressive talking head generation framework that can produce high-quality talking head videos across diverse speaking styles [Details | GitHub].
OpenAI’s app store for GPTs will launch next week [Details].
GitHub Copilot Chat, powered by GPT-4, is now generally available for both Visual Studio Code and Visual Studio, and is included in all GitHub Copilot plans alongside the original GitHub Copilot [Details].
Microsoft Research presented a new and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training step [Paper] | Hugging Face].
Google DeepMind introduced AutoRT, SARA-RT and RT-Trajectory to improve real-world robot data collection, speed, and generalization [Details].
Salesforce Research presented MoonShot, a new video generation model that conditions simultaneously on multimodal inputs of image and text, demonstrating significant improvement on visual quality and temporal consistency compared to existing models. The model can be easily repurposed for a variety of generative applications, such as personalized video generation, image animation and video editing. Models will be made public here [Details].
Leonardo AI released Leonardo Motion for generating videos from images. Available to all users, paid and free [Link].
JPMorgan AI Research present DocLLM, a layout-aware generative language model for multimodal document understanding. The spatial layout information is incorporated through bounding box coordinates of the text tokens obtained typically using optical character recognition (OCR), and does not rely on any vision encoder component [Details].
Alibaba Research introduced Make-A-Character (Mach), a framework to create lifelike 3D avatars from text descriptions. Make-A-Character supports both English and Chinese prompts. [Details | Hugging Face].
Sony, Canon and Nikon set to combat deepfakes with digital signature tech in future cameras [Details].
Meta AI introduced Fairy, a versatile and efficient video-to-video synthesis framework that generates high-quality videos with remarkable speed. Fairy generates 120-frame 512x384 videos (4-second duration at 30 FPS) in just 14 seconds, outpacing prior works by at least 44× [Details].
Apple quietly released an open source multimodal LLM, called Ferret, in October 2023 [Details].
Australian researchers introduced a non-invasive AI system, called DeWave, that can turn silent thoughts into text while only requiring users to wear a snug-fitting cap [Details].
Pika Labs text-to-video AI platform Pika 1.0 is now available to all and accessible via the web [Link].
The New York Times sued OpenAI and Microsoft for copyright infringement [Details].

Source: AI Brews newsletter- you can subscribe here. it's free to join, sent only once a week with bite-sized news, learning resources and selected tools. Thank you!

submitted by /u/wyem
[link] [comments]