A pattern fell out of the AI Overviews citation data I was reading this spring. Pages that combined text, images, embedded video, and structured data on the same URL were getting selected by AI search at roughly 156% the rate of text-only pages on the same topic. YouTube was the single most-cited domain in Google AI Overviews. Not Wikipedia, not Forbes. YouTube. The signal AI search rewards in 2026 is multi-modal density on a single page. Most service businesses are not building for it. Photographers and video producers already ship multi-modal content as a side effect of the work. We just do not use it.

What "Multi-Modal" Actually Means on a Page

Multi-modal does not mean you have a blog and also a YouTube channel. It does not mean your homepage has a hero image. It means a single URL contains, simultaneously, several content types that AI models can ingest and cross-reference: written text, embedded photographs with descriptive alt text, at least one embedded video (typically a YouTube embed), and structured data (JSON-LD) describing what the page is about.

The distinction matters because AI search engines are not parsing your site the way Google's classic crawler did in 2018. They read a page as a unit of evidence. A page that says "we do corporate photography in St. Louis," shows real photographs of real corporate work, embeds a video walkthrough of how a shoot day runs, and includes machine-readable schema confirming the business, the service, and the FAQ, is offering the model four independent confirmations of the same claim. A text-only page offers one. The multi-modal page wins citations because it gives the model more to verify against.

Why Photographers and Video Producers Are Already Halfway There

This is the part that should make every marketing director who works with a photographer pay attention. Photographers and video producers do not manufacture multi-modal content as a separate project. We ship it as the deliverable.

When I photograph a corporate team in Kansas City, the output is a folder of high-resolution images. When I shoot a behind-the-scenes day at a corporate interview, the output is video footage that becomes a YouTube upload. Every page I publish about a real shoot can be multi-modal from the first draft because the assets exist before the page does.

Most service businesses do not have this pipeline. A consulting firm writing a thought leadership post on supply chain risk has text. They do not have a five-minute video of a client interview, twelve photographs of a real engagement, and a transcript ready to embed. They could commission those assets, and the smart firms increasingly do, but they are starting from zero. Photographers and video producers are starting from a content surplus. The question is whether we package it correctly. Most of us do not.

The Signals AI Search Actually Looks For

The specific elements that move the needle, based on citation data and the schema patterns showing up on pages getting picked up by AI Overviews and ChatGPT search, are narrower than "have some images." Here is what matters.

VideoObject schema. A YouTube embed without VideoObject JSON-LD is a video the model can see but cannot easily verify. VideoObject schema declares the video's name, description, thumbnail, upload date, duration, and the URL of the embedded video. When you wrap a YouTube embed in this schema, you are handing the model a structured statement of what that video contains.

ImageObject schema and rich alt text. Images need alt text that describes what is happening, not what the image is. "Corporate team headshot" is not alt text. "Six executives from a Kansas City wealth management firm photographed against a charcoal backdrop with consistent lighting across the team" is alt text. Pair that with ImageObject schema and the model can reason about the image without seeing it.

FAQPage schema. The cheapest, highest-return addition you can make to almost any service page. A short FAQ block, marked up with FAQPage JSON-LD, gives the model literal question-and-answer pairs in machine-readable form. I wrote about this in the FAQ schema piece, and the short version is that FAQ schema is one of the most reliable ways to get cited for a specific question.

Transcripts and captions on embedded video. A YouTube embed with closed captions and a published transcript does double duty. The video counts as a media type. The transcript counts as searchable text. You get credit for two content modes from one asset.

Page-level multi-modal density. Not "is there an image somewhere on this site," but "does this specific URL contain text, images, video, and structured data, all referring to the same topic." Spreading those modes across separate pages does not produce the same signal. They have to live on the same URL.

Merrill Lynch corporate team portrait used as part of a multi-modal service page combining text, images, video, and structured data

How to Build a Multi-Modal Service Page From Scratch

If you are starting with a blank page, the recipe is not complicated. It is just disciplined.

Lead with an answer-first paragraph that names the service, the audience, and a concrete number or two. The first paragraph is what AI search will most often quote, and a paragraph that opens with a clear answer beats one that opens with throat-clearing. I covered the mechanics in the answer-first content piece, and the principle carries straight into multi-modal pages.

Embed a YouTube video early on the page, ideally close to the fold. The video should be substantively related to the service, not a generic brand reel. For a corporate photography page, that might be a behind-the-scenes walkthrough of an actual shoot day. For a video production page, it might be a tour of the gear and the interview setup. Wrap the embed in VideoObject schema and make sure the YouTube upload has captions enabled.

Include an image grid showing real client work. Three to six photographs, each with descriptive alt text, each with ImageObject schema if you are willing to do the markup. The grid should not be decorative stock. It should be evidence: this is what the deliverable looks like.

Add a short FAQ section at the bottom with three to six honest questions a buyer asks during the discovery call. Mark it up with FAQPage schema. Use the questions you actually get on calls, not marketing-flavored versions.

Layer LocalBusiness, Service, and BreadcrumbList schemas across the rest of the page. They are paragraphs of JSON. They are also the difference between a page that says it offers a service and one that machine-readably declares the service exists, who provides it, and where it sits in your site hierarchy.

What Lives on My Service Pages Right Now

This site is not a hypothetical. The corporate photography page renders text describing the service, an image grid of real client work, an embedded YouTube video walkthrough, and a stack of LocalBusiness, Service, FAQPage, and BreadcrumbList JSON-LD. The same template, with different assets, runs on the video production page. Each page is built so that an AI model parsing it has the maximum number of independent confirmations of what the page is about.

I did not build it that way because I was clever. I built it that way because the photographs and videos already existed from the work itself. The asset library was in the folder. The structured data was the part I had to add, and that took an afternoon. This is the spoke under the broader AI search visibility playbook: give the model unambiguous evidence of who you are and what you do.

For Marketing Directors Who Don't Have a Photographer on Staff

If your organization does not currently work with a photographer or video producer, the practical move is not to write twenty new blog posts hoping AI search picks one up. The practical move is to commission a single coordinated photo and video shoot day that produces enough raw material to power four to six multi-modal service pages.

A full day with a coordinated crew (photographer plus video) on your premises, photographing your team and your real work environment while filming an interview or two with the people who lead the work, produces, conservatively: 80 to 150 finished photographs across team, environmental, and detail categories, plus enough video footage for three or four polished service-page videos and a handful of social cuts. That is enough to rebuild your top service pages as multi-modal pages and still have inventory left for the next quarter.

The arithmetic is not subtle. One day of disruption and one production budget gets you the asset library AI search rewards. If this is the year you are making a serious move on AI search visibility, this is the foundational shoot to commission. The video production service and the corporate photography service can be combined into a single day for exactly this reason.

Northwestern Mutual team composite published alongside an embedded YouTube walkthrough and FAQPage schema on the same service page

The Transcript Hack Most Pages Miss

This is the most valuable move I see pages skip. When you embed a YouTube video on a service page, turn captions on, and publish the transcript on the page itself (inline or in an expandable section), you double the multi-modal signal from one asset.

Here is why. The video counts as a video content mode. The captions, embedded in the YouTube player, are read by AI models as part of the video metadata. The transcript on the page is text content that explicitly describes what the video says, in a form fully searchable and parseable. You are now getting credit, on the same URL, for video plus image (the thumbnail) plus text (the transcript) plus structured data (the VideoObject schema). One asset, four signals. This is the same compounding effect the AI citation playbook keeps coming back to: stack independent confirmations on a single URL until the model has no reason not to cite you.

Most businesses skip the transcript because it feels redundant. It is not redundant to an AI model. It is the bridge that connects the video to the rest of the page's text. A model reading the transcript can verify the video is on-topic, extract specific claims, and cite either the video or the page. Without the transcript, the video is a black box.

Need a single shoot day that fuels 4-6 multi-modal pages?

AI accelerates discoverability. The lens still does the work. The AI-Visual Branding Package combines one coordinated shoot day with the AI search setup and ongoing citation tracking.

See the package

What to Do This Week

A short list, in priority order. Pick whichever ones you can ship before next Friday.

1. Audit your three highest-traffic service pages. For each, list which of the four content modes are present: text, images, embedded video, structured data. Most pages will be missing at least two.

2. For any page missing embedded video, decide whether you have a YouTube asset that fits or need to commission one. A two-minute behind-the-scenes video shot in a single morning often covers it.

3. Add VideoObject and FAQPage schema to any page that already has the underlying content. This is a markup change, not a content change. You can ship it the same week.

4. Rewrite alt text on the images already on the page. Aim for descriptive sentences, not category labels. Costs nothing, improves both image SEO and the multi-modal signal.

5. If your team does not have a photo and video library to draw from, schedule the shoot day. One day produces months of material.

If you want to talk through which pages to rebuild first, get in touch and we'll work through it together.

Why Photographers Already Have the AI-Search Edge (We Just Don't Use It)

What "Multi-Modal" Actually Means on a Page

Why Photographers and Video Producers Are Already Halfway There

The Signals AI Search Actually Looks For

How to Build a Multi-Modal Service Page From Scratch

What Lives on My Service Pages Right Now

For Marketing Directors Who Don't Have a Photographer on Staff

The Transcript Hack Most Pages Miss

Need a single shoot day that fuels 4-6 multi-modal pages?

What to Do This Week

Topics

Related Articles

How I Got My Photography Site Cited by ChatGPT, Perplexity, and Google AI Overviews

The 90-Day Plan I Used to Get My Photography Site Cited by AI

llms.txt for Photographers: I Added One, Here's What Actually Happened

Ready to build multi-modal service pages AI search will cite?