The Stylist in the Machine: Shipping a Day-1 Fashion Recommender with LLMs

By Madhura Raut on

February 19, 2026

Launching a new fashion vertical has always been a challenge across global e-commerce. You may already have a successful marketplace with strong infrastructure, payments, logistics and even recommendations systems for other verticals. But fashion as a vertical behaves very differently. Customers aren’t just purchasing products; they are buying a whole look and outfits. Day one of your new fashion vertical launch, you face a very uncomfortable reality : You have no user data, no clicks, no carts, no purchase history all of which form the foundations of a recommender system.

Despite this challenge, you want your product page to appear intelligent. A customer looking for a white linen dress should immediately see sandals, bags and jewelry that help them complete the look. This understanding does not necessarily need to come from user data but can come from a system that understands what products make sense together.

This article focuses on how to build such a system from scratch, using Large Language Models ( LLMs ) as the core of the system, while still ensuring the solution is scalable and production ready.

Cold Start Problem: Launching a Fashion Vertical Without Historical Data

Imaging being an e-commerce giant that is expanding into the fashion vertical for the first time. The catalog is large, visually rich with each product having titles, descriptions, attribute information etc. The catalog contains all types of fashion products like dresses, shoes, bags, accessories etc. The only missing piece is data which is the very core of all recommender systems because without data what would the recommendation system even be trained on.

Despite this, the business expectations are very clear. Leadership expects large basket sizes and the product teams want to have a true outfit level discovery system. Merchandising wants the website to reflect true brand data and customers expect inspiring product pages, not empty and redundant recommendations.

This is where a Frequently Purchased Together (FPT) system traditionally comes in but the classic definition breaks down immediately in the fashion vertical.

Why Cold-Start Recommendations Matter More in Fashion than Anywhere Else

Many teams often make the mistake of waiting for data before shipping recommendation products. In fashion, this can be considered a mistake. Fashion as an industry firstly has a high decision friction. Customers rarely come in having a full understanding of what they want to buy in isolation. They rely on visual cues, styling recommendations and contextual guidance. Without these factors, basket sizes may be smaller and sessions can often end early.

More importantly, a cold start can start a self-inflicting cycle. Basically, if you don’t provide recommendations, users don't explore. If user;s don’t explore, you don't learn. This indicates that waiting for data can delay learning itself which is not beneficial. Day-1 FPT system solves multiple problems at once. It improves conversion, basket size immediately and accelerates quality signal collection that future models can learn from. This gives you a strong baseline intelligent system.

In most ML literature, FPT is treated with a statistical lens. Items that appear in the same order are linked together. Over time, co-purchase graphs form and recommendation systems emerge. In fashion, purchased together refers to products that are worn or styled together. Even though 2 dresses are purchased in a single order, that does not necessarily mean they are good FPT recommendations. Usually a dress is styled with a bag, jewelry and shoes. The relation is more semantic in nature than statistical.

This distinction is important because we move our problem from a pattern mining problem to a semantic reasoning problem. Instead of asking what the users did, the system must be able to interpret what makes sense from a styling and functionality perspective. This is precisely where LLMs come in and shine.

The Core Idea: Using LLMs as Stylists, Not Predictors

LLMs already encode world knowledge which includes a vast amount of fashion knowledge. They understand that linen is summer appropriate, raffia bags pair well with resortwear, minimalist dresses pair better with delicate jewelry than larger pieces. Instead of inferring these relationships from sparse data, we can just ask LLMs to reason about them directly.

There is an important nuance here that LLMs should not be naively used as online black boxes. A scalable system should treat LLMs as a reasoning engine, not necessarily a real-time dependency. It should treat LLMs as a cold start intelligence layer, not a replacement to the future learning model. We will see multiple types of Implementation paths we can consider to solve our problem.

Implementation Path 1: Offline LLM-Generated FBT Lists

The most reliable way to have a FPT system on day 1 is to distinguish between what needs to be deterministic and what benefits from reasoning. In this implementation path, we will be designing the system in 2 explicit stages.

Stage 1: Deterministic candidate and role selection: This stage answers an important and critical question: What kind of items could be bought together with the anchor product ? Fashion catalogs usually contain enough structure to answer this question deterministically. For example, when the anchor item is a dress - our system can confidently retrieve relevant categories like shoes, bags, jewelry etc. The goal of this stage is to simply reduce the decision space to make the task easier for the LLM in stage 2.
Stage 2: LLM based selection and ranking : Once the decision space is reduced, the LLM can be used offline once per anchor to perform the task that it can excel at : stylistic reasoning and ranking. At this stage, LLM is deciding:

Which items best complete the outfit ?
How well does each item fit the anchor items style, occasion and aesthetic ?
Which option to prioritize when tradeoffs exist ?

Below is a realistic prompt pattern that can be used in an offline batch job. It is important for the prompt to be intentionally explicit, constrained and well structured.

System Message:

You are a senior fashion stylist and merchandising assistant for an online fashion marketplace.Your goal is to recommend items that a customer would purchase together with the anchor product in order to COMPLETE a cohesive outfit.

You must follow these rules:

Recommend complementary items only. Do NOT recommend substitutes or items from the same narrow category as the anchor.

Ensure role diversity: do not recommend two items that serve the same function.

Match seasonality, occasion, and overall aesthetic.

Prefer neutral or harmonious pairings unless the anchor is a statement piece.

Output valid JSON only. Do not include explanations outside the specified fields.

User Message

The user message prompt can contain structured information about the anchor product and the candidates products that you wish to be ranked. This way the output will be structured, easily auditable and versioning should be simple.

There are some evident pros and cons of this approach:

Pros of this implementation path

Provides a cold start solution to FPT without requiring any user data.
Keep the run time latency low as the heavy lifting work is all done offline.
Easy to debug and QA given structured LLM output.

Cons of this implementation path

Freshness issues: As inventory changes and trends change, the system will not be able to react in real time since offline batch updates will have some SLA.
Candidate selection criteria: Picking out top 80-100 items for stage 2 per category is non-trivial in large catalog worlds.
Output quality is prompt sensitive as small changes in phrasing can lead to significant changes in ranking.

Implementation Path 2 : LLM-Generated Styling-Intent Embeddings for Scalable Fashion FPT

The first implementation path works well for launching a new fashion vertical but comes with its sets of challenges as catalog grows and product freshness becomes an important dimension. Implementation path 2 addresses this by shifting the role of LLM from being a ranker to an intent definition expert leading to a more scalable solution. The core idea is simple : instead of storing top FPT items, we embed the styling intent. This gives us a scalable solution that solves the complete my outfit problem by finding the right sandals, bag and jewelry.

Implementation path 2 solves this by asking the LLM a new question. Instead of asking which products should be shown from a given list? the LLM is asked what should a good complementary item look like? For each anchor item, the LLM generates short, role-specific features that capture the ideal complementary item. For example, instead of ranking a brown leather shoe directly, the LLM would produce a solid description of the ideal footwear: breathable, neutral tones and good for summer wear. This description is not something that will be visible on the customer app but rather exists as a semantic query.

The LLM generated descriptions will then be embedded using a standard text embedding model. Each anchor item can yield multiple embeddings - one for each role such as shoes, jewelry etc. Separately the catalog itself will be embedded using product titles, descriptions etc. Retrieval then becomes a standardized vector search problem matching intent embeddings to product embeddings.

The elegance of this approach lies in its scalability as you only need to generate product embeddings for fresh catalog items once. All the reasoning happens offline while retrieval remains fast and cheap. The approach also intuitively makes sense in an industry like fashion where concepts like “minimal” and “resort casual” transfer naturally across different product categories.

An example of the prompt design is as follows:

You are a fashion stylist.

Given the anchor product below, write a short description of the IDEAL item that would complement it for the specified role. This description will be used for semantic retrieval, not shown to users. Focus on style, material, formality, season, and aesthetic compatibility. Avoid brand names and avoid referencing the anchor product directly.

Anchor product:

White linen midi dress, Style: minimal, coastal, Occasion: daytime, vacation, Season: spring/summer

Role: Shoes

An example output can look like follows :

Breathable flat or low-heel sandals suitable for warm weather, neutral or natural tones, minimal design, casual daytime resort styling.

Pros of this implementation path :

Highly scalable approach as it generates intent embeddings that are generalized and the system scaled naturally as the catalog grows.
Generating separate intent embeddings for different categories makes the outputs diverse without needing a complex post processing logic.

Cons of this implementation path :

The same as implementation path 1 where the quality heavily depends on the intent prompt design.
Debugging can be challenging as multiple failure paths like intent embeddings, product embeddings and vector search exist.

Monitoring and Quality Control in an LLM-Driven FPT System

Monitoring an LLM-driven FPT system requires a different mindset than traditional recommendation evaluation. In a world where there is no historical data, quality can be judged through consistency and stability. Tracking metrics like role distribution shifts and structural drifts can help understand issues sooner than user metrics.

As the system gains traffic, lightweight input metrics like attach rate, add to cart frequency become useful feedback loops. Since LLM behavior is sensitive to prompts, maintaining prompt versions and regularly reviewing outputs can help. In practice, combining automated monitoring and human reviews during season transitions should be enough to maintain quality of recommendations.

Closing Thoughts: Cold Start Is No Longer an Excuse

For decades, cold start was treated as the unavoidable problem in recommender and search systems but LLMs change this assumption. They allow teams to deploy domain knowledge, styling logic directly without having to wait for data collection. A cold start FPT system is not just a revenue driver but it also shows customers that the platform understands style and intent. With the right architectural implementations, it can be built in a scalable and controlled fashion.

About the Author

Madhura Raut is a Principal Data Scientist and tech leader in the San Francisco Bay Area with 9+ years of experience delivering production ML systems, with a focus on time-series forecasting, deep learning, and agentic AI. She is an IOA Fellow and Senior Member of IEEE and IET, also an active speaker, reviewer, hackathon judge, and mentor, with talks including KDD 2025, Data Science Salon, and features across Entrepreneur, The Economic Times, Mashable, and Vogue. LinkedIn : https://www.linkedin.com/in/madhuraraut/

Disclaimer: The authors are completely responsible for the content of this article. The opinions expressed are their own and do not represent IEEE’s position nor that of the Computer Society nor its Leadership.