Build Multimodal Video Scripts with Make.com

Overview

This is a multimodal AI content creation automation solution.

Whether your material is text, audio, or video, it can be automatically transformed into stylized scripts:

Material Input - Input text or audio/video links in Notion
Smart Recognition - Auto-detect material type and select processing pipeline
Content Extraction - Gemini analyzes video / Replicate transcribes audio
Segment Processing - DataStore intelligently segments long texts
Style Generation - DeepSeek R1 generates social media-optimized scripts

Multimodal workflow Workflow supporting text, audio, and video multimodal inputs

Core Decision Factors

When choosing multimodal content generation solutions, consider:

Multimodal Support - Can it handle text, audio, video, and other materials
Content Quality - Logical coherence and stylization level of generated content
Long Text Capability - Ability to process thousands of words
Cost-effectiveness - Balance between API costs and output value
Ease of Use - Complexity of workflow setup and daily operations

Technical Specifications

Specification	Value	Notes
Core Platform	Make.com	Workflow orchestration
Database	Notion	Material management & result storage
Video Analysis	Google Gemini Pro	Flash model deep analysis
Material Processing	OpenAI GPT-4o Mini	Initial extraction & processing
Data Storage	Make DataStore	Long text segmentation storage
Script Generation	Volcano Engine DeepSeek R1	Stylized writing
Audio Transcription	Replicate	Fast long audio transcription
Generation Cost	~$1/million tokens	Can generate 300-400k words
Video Analysis Time	150-200 seconds	Wait after video upload
Script Duration	280-300 words/minute	3000 words supports 10-minute video

Prerequisites

Before starting, ensure you have:

Make.com account (free registration)
Notion account and database
Google Gemini API key
Volcano Engine API key (ByteDance DeepSeek R1)
Replicate account (audio transcription)
Open-source direct-link cloud storage (for audio/video file storage)

Notion Database Structure

Create material management database with these fields:

Material Type (Select) - Text/Audio/Video
Material Content (Text) - Text content or audio/video link
Status (Select) - Pending/Started/Completed
Writing Style (Text) - Expected script style
Additional Requirements (Text) - Other customization needs
Generated Result (Text) - AI-generated script

Multimodal Processing Architecture

Text Material Processing

Directly pass text content to generation module:

Process:

Fetch text material from Notion
Use GPT-4o Mini for initial processing
Pass to DeepSeek R1 to generate script

Audio Material Processing

Use Replicate for audio transcription:

Configuration Points:

Supports long audio (tens of minutes) processing
Excellent Chinese and English recognition
More stable than OpenAI official module

Process:

Get audio direct link URL
Replicate transcribes to text
Pass to generation module

Video Material Processing

Complete workflow architecture Multimodal workflow module connections in Make platform

Use Google Gemini for deep video analysis:

Configuration Points:

Upload video file to Gemini
Wait 150-200 seconds for analysis
Output precise transcript

Process:

Download video and get direct link
Upload to Google Gemini
Deep analysis to extract content
Pass to generation module

Long Text Segmentation Processing

Core mechanism to solve large model single output 1000-2000 word limits:

DataStore segmentation principle DataStore data storage and flow diagram

Implementation:

Smart Segmentation - Divide long materials into 500/1000-word segments
DataStore Storage - Save generated content as context
Repeater Loop - Generate and accumulate segment by segment
Differentiated Prompts - Use different strategies for first and subsequent segments

First Segment Prompt:

Based on the following material, generate the opening part of a script.
Requirements: Conversational social media style, capture audience attention...

Subsequent Segment Prompt:

Continue generating script content, maintaining coherence with previous text.
Previously generated content: {{previous_content}}
Current material segment: {{current_segment}}

Stylized Writing

Generated results example AI-generated script content and segmented layout

Volcano Engine DeepSeek R1’s stylization capabilities:

Features:

Supports separating thinking process from content
Transforms serious content into conversational expression
Adapts to finance, film, parenting, and multiple domains

Style Transformation Examples:

Economic theory → “Cycle of seasons” metaphors
Technical jargon → Vivid analogies and storytelling
Formal language → Social media conversational hooks

Style transformation case News text transformed into social media style script

Gotchas

Common issues during setup:

Manual Preprocessing - Video downloading and direct-link generation require manual work
Learning Curve - Make workflow setup and logic understanding require time investment
Over-stylization - DeepSeek R1 may add elements not in original text; requires human review
Notion Permission Config - New databases need separate Make authorization access
File Size Limits - Make free tier has small file download limits; large videos need manual upload
Content Expansion Risk - With limited material, AI expansion may introduce non-original elements

Use Cases

Recommended Users

Content Creators - Short video, live stream professionals needing efficient scripts
Content Repurposers - Users transforming audio/video materials into text content
Style Differentiation Seekers - Creators wanting to transform serious content into conversational style
Efficiency Pursuers - Willing to invest time learning for scale production

May Not Suit

Users completely unwilling to learn new tools
Users with extremely high accuracy requirements unwilling to review
Users resistant to API configuration and third-party tool integration

FAQ

What material types are supported?

Supports three types: text, audio (MP3), and video (MP4). Can source materials from YouTube, social platforms, or any video sharing sites.

How to handle long text output limits?

The workflow uses Make DataStore and Repeater modules for intelligent segmentation, with different prompts for first and subsequent segments to ensure context coherence.

Is generation cost high?

Volcano Engine DeepSeek R1 costs ~$1/million tokens, can generate 300,000-400,000 words of scripts. Extremely cost-effective for multiple iterations.

How long does video analysis take?

Google Gemini video analysis takes approximately 150-200 seconds, depending on video length and complexity.

Next Steps

After mastering basics, you can try:

Adding more writing style templates
Integrating auto-download tools to reduce manual steps
Adding multi-platform one-click distribution
Building script quality scoring and filtering mechanisms

Questions? Feel free to leave comments!

Build Multimodal Video Scripts with Make.com

Ready to automate?

Overview

Core Decision Factors

Technical Specifications

Prerequisites

Notion Database Structure

Multimodal Processing Architecture

Text Material Processing

Audio Material Processing

Video Material Processing

Long Text Segmentation Processing

Stylized Writing

Gotchas

Use Cases

Recommended Users

May Not Suit

FAQ

What material types are supported?

How to handle long text output limits?

Is generation cost high?

How long does video analysis take?

Next Steps

FAQ

Start Building Your Automation Today

Related Tutorials

Create Viral Content with Make.com & DeepSeek AI

Build Notion Book Library with Make.com & GPT-4o Vision

Automate Blog Writing with Make.com & Firecrawl Web Scraper

Automate PDF Analysis with Make.com & Kimi 128K Context

About the author

Alex Chen