首页 > English 英语
云上岭南 Lingnan on the Cloud
Tsinghua unveils China's first Sora-level text-to-video large model Vidu to combat OpenAI
来源:羊城晚报 云上岭南 发表时间:2024-04-29 00:04

At the 2024 ZGC Forum, Tsinghua University, joining hands with Shengshu Technology, unveiled Vidu, China's first video model with extended duration, exceptional consistency, and dynamic capabilities. Using the team's original U-ViT architecture, which combines Diffusion and Transformer, Vidu is capable of generating high-definition videos with a long duration of up to 16 seconds and a resolution of 1080P.

undefined

It is reported that Vidu can not only simulate the real physical world but also possesses a rich imagination, featuring multi-view generation and high spatiotemporal consistency. It is a groundbreaking video mega-model that rivals the international benchmark following the release of Sora, and is constantly improving under accelerated iteration.

During the Forum, Zhu Jun, Professor from Tsinghua University and Chief Scientist of Shengshu Technology, explained that Vidu can generate high-quality videos up to 16 seconds long based on provided textual descriptions, just like Sora. In addition to the breakthrough in video duration, significant improvements can also be seen in video effects, particularly in the simulation of the real physical world, multi-camera perspectives, high spatiotemporal consistency, and the understanding of specific Chinese elements.

undefined

According to Zhujun, Vidu's swift breakthrough stems from the team's long-term accumulation and multiple original achievements in Bayesian machine learning and multimodal mega-models. The core U-ViT architecture technology, proposed and independently developed by the team in September 2022, predates DiT, the architecture adopted by Sora, as the world's first architecture integrating Diffusion and Transformer. 

Within just two months after the release of Sora in February 2024, drawing on an in-depth understanding of the U-ViT architecture and their extensive experience in engineering and data, the team further advanced key technologies in long video representation and processing, developed and launched the Vidu video mega-model with improved video coherence and dynamism.

"The name Vidu not only sounds like 'Video' but also carries the meaning of 'We do,'" said Professor Zhu Jun. The breakthrough of the model is a multidimensional and cross-domain process that requires the deep integration of technologies and industrial applications. He also expressed his hope to strengthen cooperation with upstream and downstream companies in the industry chain, as well as other research institutions, so as to jointly promote the development of video large models.

After Vidu's release, Professor Zhu Jun posted a sentence in WeChat Moments, "Vidu, we do, we did, we do together! Thanks to the relentless dedication of my partners, the laboratory has blossomed with fruitful results."

undefined

Behind Vidu is Shengshu Technology, a star startup originating from Tsinghua University.

According to publicly available information, Shengshu Technology was established in March 2023, with core members hailing from Tsinghua University's Institute for Artificial Intelligence. It is dedicated to independently developing controllable universal large multimodal models that lead the world. The CEO, Tang Jiayu, holds both a bachelor's and master's degree from Tsinghua University's Department of Computer Science and Technology, while the Chief Scientist, Zhu Jun, is the Deputy Dean of Tsinghua's Institute for Artificial Intelligence. The CTO, Bao Fan, is a doctoral student from Tsinghua's Department of Computer Science and Technology and a member of Professor Zhu Jun's research group, with a long-term focus on diffusion model research.

Source :Yangcheng Evening News

对标OpenAI!清华团队国产原创Sora级视频大模型Vidu发布

在2024中关村论坛年会未来人工智能先锋论坛上,清华大学联合生数科技27日正式发布中国首个长时长、高一致性、高动态性视频大模型——Vidu。该模型采用团队原创的Diffusion与Transformer融合的架构U-ViT,支持一键生成长达16秒、分辨率高达1080P的高清视频内容。

据介绍,Vidu不仅能够模拟真实物理世界,还拥有丰富想象力,具备多镜头生成、时空一致性高等特点。Vidu是自Sora发布之后全球率先取得重大突破的视频大模型,性能全面对标国际顶尖水平,并在加速迭代提升中。

在当天的论坛上,清华大学教授、生数科技首席科学家朱军表示,与Sora一致,Vidu能够根据提供的文本描述直接生成长达16秒的高质量视频。除了在时长方面的突破外,Vidu在视频效果方面实现显著提升,主要体现在模拟真实物理世界、多镜头语言、时空一致性高、理解中国元素等方面。

朱军表示,Vidu的快速突破源自团队在贝叶斯机器学习和多模态大模型的长期积累和多项原创性成果。其核心技术U-ViT架构由团队于2022年9月提出,早于Sora采用的DiT架构,是全球首个Diffusion与Transformer融合的架构,完全由团队自主研发。

自今年2月Sora发布推出后,团队基于对U-ViT架构的深入理解以及长期积累的工程与数据经验,在短短两个月进一步突破长视频表示与处理关键技术,研发推出Vidu视频大模型,显著提升视频的连贯性与动态性。

“Vidu的命名不仅谐音‘Vedio’,也蕴含‘We do’的寓意。”朱军表示,模型的突破是一个多维度、跨领域的综合性过程,需要技术与产业应用的深度融合,希望与产业链上下游企业、研究机构加强合作,共同推动视频大模型进展。

在Vidu发布后,朱军也在微信朋友圈发声表示:“Vidu, we do, we did, we do together!感谢小伙伴们日以继夜的坚持,在实验室架构上开花结果。”

Vidu的背后,是一家来自清华的明星创业公司生数科技。

公开资料显示,生数科技成立于2023年3月,核心成员来自清华大学人工智能研究院,致力于自主研发世界领先的可控多模态通用大模型。公司的CEO本硕就读于清华大学计算机系的唐家渝,首席科学家由清华人工智能研究院副院长朱军担任,CTO鲍凡则是清华大学计算机系博士生、朱军教授的课题组成员,长期关注扩散模型领域研究。

来源|羊城晚报·羊城派综合中国新闻网、证券时报
译|洪婷