DiffusionGemma at 1,000 tokens/sec on H100, Gemini business tools, and new agent consoles
TL;DR
Google and NVIDIA pushed faster local text generation while new agent tools and video models appeared on Replicate and Product Hunt.
What shipped
On 10 June several vendors released production-ready AI components. NVIDIA optimized a new Google model for local GPUs and extended confidential computing to Apple workloads. Builders also gained fresh agent consoles and a reasoning video model.
Vendor launches
NVIDIA led with two technical releases around local inference and confidential compute while Google added business features to Gemini and expanded Chrome AI capabilities. The items focus on measurable speed gains and enterprise privacy rather than awareness campaigns.
- •NVIDIA robotaxi safety NVIDIA published guidance on embedding safety directly into robotaxi operating systems instead of adding it later, aimed at fleets already operating in multiple cities.
- •NVIDIA DiffusionGemma optimization NVIDIA tuned Google DeepMind DiffusionGemma to run on RTX GPUs and DGX systems, delivering parallel text blocks instead of token-by-token output for lower latency on single-user tasks.
- •Gemini Chrome expansion Google rolled Gemini AI features into Chrome for users across Latin America, Africa, and the Middle East, extending the same capabilities already available in other regions.
- •Gemini business tools Google added new Gemini app features that let entrepreneurs draft content, analyze data, and automate routine tasks without switching between multiple apps.
- •NVIDIA Apple Private Cloud Compute NVIDIA Confidential Computing now powers server-side inference for Apple Foundation Models on Google Cloud, extending Apple Private Cloud Compute beyond Apple data centers.
Hugging Face trending
Two Gemma variants and a new audio model from Google appeared in the trending list, alongside a research paper on multimodal training choices. The models support direct download and fine-tuning through the Hub.
- •diffusiongemma-26B-A4B-it Google released diffusiongemma-26B-A4B-it, an image-text-to-text model that generates text blocks in parallel and is available for inference on Hugging Face.
- •Huihui-gemma-4-12B-it-abliterated Huihui-ai published an abliterated Gemma variant that supports any-to-any tasks and runs via the transformers library on the Hub.
- •magenta-realtime-2 Google placed magenta-realtime-2 on the Hub, a text-to-audio model built with its Magenta library for real-time audio generation.
- •Multimodal learning phase diagram A new paper examines when cross-modal alignment or prediction works best, giving practitioners a decision framework for multimodal projects.
Replicate new models
ray-3.2: Luma released ray-3.2 on Replicate, a video model that creates 5- or 10-second cinematic clips from text or images with native HDR and EXR output.
Product Hunt picks
Three agent-focused tools launched, covering local trust consoles, serverless agent hosting, and live speech translation.
- •Timmy-TUI Timmy-TUI offers a local-first console for managing agent trust and workspaces without sending data to remote servers.
- •AGNT.Hub AGNT.Hub lets users run always-on agents without provisioning or maintaining servers.
- •Gemini 3.5 Live Translate Google introduced Gemini 3.5 Live Translate, an audio model for real-time speech-to-speech translation.
Industry news
Spending data and model releases highlighted the cost of staying competitive while OpenAI signaled a possible 2027 IPO timeline. DiffusionGemma received independent coverage for its speed-quality trade-off.
- •AI spending levels Ramp data shows the heaviest AI users now spend about $7,500 per employee each month on tools and infrastructure.
- •DiffusionGemma speed Google DiffusionGemma reaches roughly 1,000 tokens per second on one H100, four times faster than standard autoregressive models, at the expense of output quality.
Other
LangChain and Databricks released infrastructure updates for agent traces and custom model serving while HeyGen published a practical video tutorial.
- •SmithDB full-text search LangChain added inverted-index search over agent traces stored in object storage, achieving 400 ms median latency on nested JSON.
- •LangChain headless agents LangChain introduced client-side tool execution so agents can access browser APIs and local state without server round-trips.
- •HeyGen AI video tutorial HeyGen released a step-by-step guide for producing AI videos in 2026 using current tools and workflows.
- •Databricks adaptive serving Databricks launched a serving platform that automatically adjusts inference settings to match custom model requirements.
What this means for you
For Vibe Builders: You can now test DiffusionGemma locally on RTX hardware for fast parallel text output and drop AGNT.Hub or Timmy-TUI into projects to run agents without servers. Gemini business features let you automate drafting and data tasks inside one app. The new Replicate video model gives quick access to cinematic clips with HDR export.
For Non-techies: Gemini tools added in the app and Chrome now handle everyday business writing and translation for users in more regions. Live Translate and the HeyGen tutorial make it simpler to create or localize short videos without hiring specialists. Spending reports show many companies are already paying thousands per person for these capabilities.
For Developers: NVIDIA optimizations and Apple Private Cloud Compute integration give concrete benchmarks for confidential inference on H100 and RTX hardware. LangChain updates for client-side execution and SmithDB search let you move agent state and traces closer to the user. Watch the 1,000 tokens-per-second DiffusionGemma numbers against your current autoregressive stack before adopting.
What to watch next
Track Hugging Face downloads for diffusiongemma-26B-A4B-it and any follow-up quality fixes from Google. Monitor Replicate run counts for ray-3.2 and new Product Hunt agent consoles for adoption signals. Check next earnings calls for updated AI infrastructure spend figures.
Harsh’s take
The day split between headline speed claims and quiet infrastructure releases. DiffusionGemma trades quality for tokens per second while most new agent tools still require manual trust and sandbox setup. Heavy monthly spend numbers from Ramp suggest the real constraint is integration cost rather than model access. Builders should run a single local DiffusionGemma workload on their own RTX card this week and measure end-to-end latency against their current stack before adding another hosted service.
by Harsh Desai
Sources
Vendor launches
- •For Robotaxis, Safety Must Be Built In, Not Bolted On
- •NVIDIA Accelerates Google DeepMind’s DiffusionGemma for Local AI
- •Google for Brazil 2026: Helping people make the most of AI
- •We’re expanding Gemini in Chrome to users in Latin America, Africa, the Middle East and more.
- •Save time and grow your business with new Gemini tools
- •The Future Report: UK Teen Research Launch
- •Guiding the AI generation: Why safeguarding and digital literacy must go hand-in-hand
- •The Future Report: Why young people must help shape the future of AI
- •Helping students and parents prepare for the final exams period
- •NVIDIA Confidential Computing to Help Expand Apple’s Private Cloud Compute
Hugging Face trending
- •diffusiongemma-26B-A4B-it by google trends on HuggingFace
- •Huihui-gemma-4-12B-it-abliterated by huihui-ai trends on HuggingFace
- •magenta-realtime-2 by google trends on HuggingFace
- •When to Align, When to Predict: A Phase Diagram for Multimodal Learning
Replicate new models
Product Hunt picks
Industry news
- •‘AI-pilled’ firms spend $7,500 per employee each month on AI
- •OpenAI's IPO slips as Altman tells staff to expect a public offering "within the next year"
- •Google's new open model DiffusionGemma generates text from noise instead of word by word
- •Fresh off bond sale, Amazon borrows $17.5B from banks as AI spending continues
Other
- •Full Text Search in SmithDB: Designing an Inverted Index for Object Storage
- •The Missing Link Between Agents and Applications
- •EPICS in IEEE’s Awards Honor Outstanding Students and Faculty
- •How-toPublishedJune 10th, 2026How to make AI videos in 2026 (a step-by-step tutorial)
- •AI Serving Platform That Adapts to Your Model
More AI news
- FeatureLius model applies continual instruction tuning for Kupang Malay translation
Lius introduces an LLM fine-tuned via continual instruction tuning to improve translation for low-resource Kupang Malay.
- FeatureBenchmark frames hour-long video grounding as search problem
New benchmark and decomposition examine natural-language temporal grounding over hour-long videos, extending prior work limited to short clips.
- FeatureOn the Limits of LLM-as-Judge for Scientific Novelty Assessment
LLMs now generate and judge scientific ideas, making novelty evaluation a key challenge. Researchers examine research questions as a focused case separate from full method and feasibility assessment.