Need help with AI image-to-video description generation

You’re already thinking in the right “multi-pass” direction. I’ll add a different angle: treat this as a data & control problem, not just a prompt-engineering one.

1. Stop trusting the model’s eyeballs by default

Where I slightly disagree with @nachtdromer: relying heavily on frame → JSON → LLM inference can still drift. Vision models are noisy. Instead, build redundancy:

  • Run 2 lightweight vision passes per keyframe (for example, CLIP-style tags + a detector/segmenter).
  • Only keep objects / attributes that both passes broadly agree on.
  • Expose uncertainty: let attributes be “strong,” “medium,” “weak” instead of all-or-nothing. Your text model can be told: “only verbalize strong + medium facts.”

That alone cuts hallucinations a lot.

2. Force alignment between video dynamics and text

Most descriptions are static even when the clip isn’t. To fix that, structure your input like this before the LLM ever writes prose:

{
  'global_scene': {...},
  'entities': [
    {
      'id': 'car_1',
      'type': 'vehicle',
      'attributes': { 'color': 'red' },
      'track': [
        {'t':0.0, 'x':0.1, 'y':0.5},
        {'t':1.0, 'x':0.3, 'y':0.5},
        {'t':2.0, 'x':0.6, 'y':0.5}
      ]
    }
  ],
  'camera': {
    'motion': [
      {'t':0.0, 'zoom':'wide'},
      {'t':2.0, 'zoom':'close'}
    ]
  }
}

Then prompt something like:

Describe each entity using how its position or size changes over time. Always mention at least one motion verb derived from the track (e.g. moves left, approaches, recedes, stays still).

This prevents the model from collapsing your clip into “a static shot of a red car.”

3. Use templates like a layout, not like generic boilerplate

To beat generic text, give the model a fixed layout where each sentence has a job. Example:

  1. Sentence 1: Who/what + setting + visual style.
  2. Sentence 2: Main subject’s motion or action across time.
  3. Sentence 3: Camera movement or framing changes.
  4. Optional sentence 4: Lighting/color/mood change.

Then explicitly block fluff:

If a slot has no information (e.g. no camera motion), skip that sentence instead of inventing details.

You end up with consistent, dense descriptions that still read naturally.

4. Separate “search text” from “human text”

People often try to make one perfect paragraph that serves:

  • UX reading
  • Search / retrieval
  • Accessibility

That is how you end up with vague soup.

Instead, build two outputs from the same structured data:

A. Retrieval-focused text

  • Single sentence or short paragraph.
  • Overweight nouns and adjectives.
  • Explicit labels like: “medium shot, low angle, cyberpunk city street, neon signage, wet asphalt, rainy night, blue and magenta highlights.”

B. Human-friendly narration

  • Short, polished, ~40–70 words.
  • Allowed to merge details and remove repetition.

You can even train your own search index only on the retrieval text and leave the pretty description for front-end display.

This is where a dedicated “AI image to video description generator” pipeline really behaves like a product, not a single LLM call.

Pros of this dual-output approach:

  • Better search quality
  • Cleaner UI copy
  • Easier to debug (you know which layer failed)

Cons:

  • Two generations per clip
  • Slightly higher latency and cost

5. Borrow from captioning datasets, not from marketing copy

To avoid “stunning,” “beautiful,” “cinematic” spam, bias your prompts toward the style of MS-COCO / VideoCaption datasets:

  • Short, literal, compositional.
  • Strong on “who does what where,” weak on “this looks amazing.”

Example prompt snippet:

Use simple, factual language similar to image captioning datasets. Focus on objects, positions, and actions. Avoid opinions or value judgments like “beautiful,” “epic,” or “stunning.”

This nudges the model closer to grounded captions and away from trailer text.

6. Make the model choose what to omit

When everything is “describe in detail,” you get either walls of text or vague summaries.

Instead, give it a budget and an explicit decision rule:

You may mention at most 8 distinct visual elements.
First, rank all detected elements by:

  1. How central they are in the frame(s).
  2. How long they are visible.
  3. How much they move or change.
    Only describe the top-ranked ones.

By forcing prioritization, you encourage concrete, salient details over listing every background object.

7. Lightweight self-calibration without a full critique pass

I like @nachtdromer’s self-critique trick, but you can make it cheaper:

  1. Generate the description.

  2. Ask the model in a second, tiny prompt:

    From this description, list the top 5 visual elements that are claimed to be present.

  3. Automatically compare those 5 against your structured data:

    • If any are unsupported, flag and re-run generation with a stricter prompt:

      You previously mentioned unsupported elements: X, Y. Regenerate the description without them and only use entities that appear in the data.

No need for a long analysis step, just a small guardrail loop.

8. Explicit style channel for AI-generated content

Because your videos come from AI images, you often know style tags up front. Instead of letting the model guess:

  • Maintain a separate style field per clip: ['oil painting', 'isometric', 'pixel art', 'photorealistic', 'anime', 'cyberpunk', ...]

  • Feed it in separately and constrain usage:

    You may only use style words from this list and only if they clearly fit the visuals. Do not invent new style labels.

This keeps your descriptions consistent across a series created with similar prompts, and it is easier to filter or cluster by style later.

Pros:

  • High style consistency across a dataset
  • Easy to filter/search by art style

Cons:

  • Requires you to track style metadata or tag it reliably
  • Can feel slightly rigid if the visual result drifts from the original style intent

9. Where this differs from @nachtdromer

  • I lean heavier on:

    • Redundant vision passes and agreement checking.
    • Explicit motion tracks and role-based sentence templates.
    • Strict priority rules for what gets mentioned.
  • I lean lighter on:

    • Long self-critique chains.
    • Very detailed JSON diffing across keyframes.
      Instead I prefer a small set of ranked entities + motion tracks.

If you share the rough average clip length and whether latency / cost are your main constraints, you can trim this to a very lean 2–3 step pipeline that still gives you accurate, non-generic video descriptions from AI images.