LLM-Graded Labels Outperform 1.5M Purchase Labels in Fashion Search Cross-Encoder Training
Key Takeaways
- ▸LLM-graded labels significantly outperformed 1.5M purchase labels when training a cross-encoder for fashion search reranking
- ▸The critical factor in model improvement was label quality, not data volume, with $25 in LLM-graded labels beating free training data at scale
- ▸Cross-encoders that jointly process queries and documents are slower but far more accurate than bi-encoders, making them ideal for reranking top-K candidate sets
Summary
Hopit.ai demonstrated a surprising finding in its fashion search optimization work: a cross-encoder model trained on just $25 worth of LLM-graded labels significantly outperformed the same architecture trained on 1.5 million purchase labels. The company was investigating where the actual value lies in fine-tuning reranker models—in the model architecture, training recipe, or data quality—after discovering that an off-the-shelf 2019 cross-encoder (ms-marco-MiniLM-L-6-v2) was responsible for approximately 51% of their end-to-end search pipeline gains.
In their experiments, the team first tried the intuitive approach of fine-tuning the cross-encoder on 253K purchase queries with 1.5M training pairs derived from implicit negatives (products shown but not purchased). This conventional approach produced minimal improvements in ranking metrics. However, when they pivoted to using LLM-graded labels—starting with a $2 pilot that beat the purchase-label version, then scaling to $25—the results dramatically improved. This finding challenges conventional wisdom about data quantity in machine learning and suggests that label quality, as determined by language models, may be a more critical factor than raw training data volume for ranking tasks.
- Off-the-shelf models trained on unrelated domains (web search) can provide substantial value, but domain-specific fine-tuning on high-quality labels yields marked improvements
- The research shifts focus from model architecture to data strategy as the primary lever for improving ranking performance in specialized domains
Editorial Opinion
This work challenges a common assumption in machine learning: that more data is always better. Hopit.ai's finding that $25 of carefully curated LLM-graded labels beats millions of implicit purchase labels suggests the field should reconsider how we approach labeling strategies. The shift toward LLM-generated labels as a cost-effective alternative to expensive human annotation or noisy implicit signals could have profound implications for companies building ranking systems, particularly in vertical domains like fashion where domain expertise is valuable but expensive.


