Skip to main content
MakeAIGuide
Advanced 60 min read Updated Jan 5, 2026

Build Multimodal Video Scripts with Make.com

Transform text, audio & video into viral scripts using Google Gemini, Replicate & DeepSeek R1. Automate multimodal content creation for social media.

Ready to automate?

Start building this workflow with Make.com — free forever on the starter plan.

Try Make.com Free

Overview

This is a multimodal AI content creation automation solution.

Whether your material is text, audio, or video, it can be automatically transformed into stylized scripts:

  1. Material Input - Input text or audio/video links in Notion
  2. Smart Recognition - Auto-detect material type and select processing pipeline
  3. Content Extraction - Gemini analyzes video / Replicate transcribes audio
  4. Segment Processing - DataStore intelligently segments long texts
  5. Style Generation - DeepSeek R1 generates social media-optimized scripts

Multimodal workflow Workflow supporting text, audio, and video multimodal inputs


Core Decision Factors

When choosing multimodal content generation solutions, consider:

  • Multimodal Support - Can it handle text, audio, video, and other materials
  • Content Quality - Logical coherence and stylization level of generated content
  • Long Text Capability - Ability to process thousands of words
  • Cost-effectiveness - Balance between API costs and output value
  • Ease of Use - Complexity of workflow setup and daily operations

Technical Specifications

SpecificationValueNotes
Core PlatformMake.comWorkflow orchestration
DatabaseNotionMaterial management & result storage
Video AnalysisGoogle Gemini ProFlash model deep analysis
Material ProcessingOpenAI GPT-4o MiniInitial extraction & processing
Data StorageMake DataStoreLong text segmentation storage
Script GenerationVolcano Engine DeepSeek R1Stylized writing
Audio TranscriptionReplicateFast long audio transcription
Generation Cost~$1/million tokensCan generate 300-400k words
Video Analysis Time150-200 secondsWait after video upload
Script Duration280-300 words/minute3000 words supports 10-minute video

Prerequisites

Before starting, ensure you have:

  • Make.com account (free registration)
  • Notion account and database
  • Google Gemini API key
  • Volcano Engine API key (ByteDance DeepSeek R1)
  • Replicate account (audio transcription)
  • Open-source direct-link cloud storage (for audio/video file storage)

Notion Database Structure

Create material management database with these fields:

  • Material Type (Select) - Text/Audio/Video
  • Material Content (Text) - Text content or audio/video link
  • Status (Select) - Pending/Started/Completed
  • Writing Style (Text) - Expected script style
  • Additional Requirements (Text) - Other customization needs
  • Generated Result (Text) - AI-generated script

Multimodal Processing Architecture

Text Material Processing

Directly pass text content to generation module:

Process:

  1. Fetch text material from Notion
  2. Use GPT-4o Mini for initial processing
  3. Pass to DeepSeek R1 to generate script

Audio Material Processing

Use Replicate for audio transcription:

Configuration Points:

  • Supports long audio (tens of minutes) processing
  • Excellent Chinese and English recognition
  • More stable than OpenAI official module

Process:

  1. Get audio direct link URL
  2. Replicate transcribes to text
  3. Pass to generation module

Video Material Processing

Complete workflow architecture Multimodal workflow module connections in Make platform

Use Google Gemini for deep video analysis:

Configuration Points:

  • Upload video file to Gemini
  • Wait 150-200 seconds for analysis
  • Output precise transcript

Process:

  1. Download video and get direct link
  2. Upload to Google Gemini
  3. Deep analysis to extract content
  4. Pass to generation module

Long Text Segmentation Processing

Core mechanism to solve large model single output 1000-2000 word limits:

DataStore segmentation principle DataStore data storage and flow diagram

Implementation:

  1. Smart Segmentation - Divide long materials into 500/1000-word segments
  2. DataStore Storage - Save generated content as context
  3. Repeater Loop - Generate and accumulate segment by segment
  4. Differentiated Prompts - Use different strategies for first and subsequent segments

First Segment Prompt:

Based on the following material, generate the opening part of a script.
Requirements: Conversational social media style, capture audience attention...

Subsequent Segment Prompt:

Continue generating script content, maintaining coherence with previous text.
Previously generated content: {{previous_content}}
Current material segment: {{current_segment}}

Stylized Writing

Generated results example AI-generated script content and segmented layout

Volcano Engine DeepSeek R1’s stylization capabilities:

Features:

  • Supports separating thinking process from content
  • Transforms serious content into conversational expression
  • Adapts to finance, film, parenting, and multiple domains

Style Transformation Examples:

  • Economic theory → “Cycle of seasons” metaphors
  • Technical jargon → Vivid analogies and storytelling
  • Formal language → Social media conversational hooks

Style transformation case News text transformed into social media style script


Gotchas

Common issues during setup:

  1. Manual Preprocessing - Video downloading and direct-link generation require manual work

  2. Learning Curve - Make workflow setup and logic understanding require time investment

  3. Over-stylization - DeepSeek R1 may add elements not in original text; requires human review

  4. Notion Permission Config - New databases need separate Make authorization access

  5. File Size Limits - Make free tier has small file download limits; large videos need manual upload

  6. Content Expansion Risk - With limited material, AI expansion may introduce non-original elements


Use Cases

  • Content Creators - Short video, live stream professionals needing efficient scripts
  • Content Repurposers - Users transforming audio/video materials into text content
  • Style Differentiation Seekers - Creators wanting to transform serious content into conversational style
  • Efficiency Pursuers - Willing to invest time learning for scale production

May Not Suit

  • Users completely unwilling to learn new tools
  • Users with extremely high accuracy requirements unwilling to review
  • Users resistant to API configuration and third-party tool integration

FAQ

What material types are supported?

Supports three types: text, audio (MP3), and video (MP4). Can source materials from YouTube, social platforms, or any video sharing sites.

How to handle long text output limits?

The workflow uses Make DataStore and Repeater modules for intelligent segmentation, with different prompts for first and subsequent segments to ensure context coherence.

Is generation cost high?

Volcano Engine DeepSeek R1 costs ~$1/million tokens, can generate 300,000-400,000 words of scripts. Extremely cost-effective for multiple iterations.

How long does video analysis take?

Google Gemini video analysis takes approximately 150-200 seconds, depending on video length and complexity.


Next Steps

After mastering basics, you can try:

  • Adding more writing style templates
  • Integrating auto-download tools to reduce manual steps
  • Adding multi-platform one-click distribution
  • Building script quality scoring and filtering mechanisms

Questions? Feel free to leave comments!

FAQ

What material types are supported?
Supports three types: text, audio (MP3), and video (MP4). Can source materials from YouTube, social platforms, or any video sharing sites.
How to handle long text output limits?
The workflow uses Make DataStore and Repeater modules for intelligent segmentation, with different prompts for first and subsequent segments to ensure context coherence.
Is generation cost high?
Volcano Engine DeepSeek R1 costs ~$1/million tokens, can generate 300,000-400,000 words of scripts. Extremely cost-effective for multiple iterations.
How long does video analysis take?
Google Gemini video analysis takes approximately 150-200 seconds, depending on video length and complexity.

Start Building Your Automation Today

Join 500,000+ users automating their work with Make.com. No coding required, free to start.

Get Started Free
No credit card required1,000 free operations/month5-minute setup

Related Tutorials

About the author

AC

Alex Chen

Automation Expert & Technical Writer

Alex Chen is a certified Make.com expert with 5+ years of experience building enterprise automation solutions. Former software engineer at tech startups, now dedicated to helping businesses leverage AI and no-code tools for efficiency.

Credentials

Make.com Certified PartnerGoogle Cloud Certified500+ Automations BuiltFormer Software Engineer
Try Make.com Free