The claim that general image models struggle with garment identity should not rest on intuition. It should be a measurement that we repeat every time a frontier lab ships a new model. Bamboo bench is that measurement. It is a benchmark built from our own garment-identity graph, and it is the instrument that decides whether we train models at all.
The benchmark tests four things. The first is single-image identity: given reference photos of a garment, does a generated image contain that specific garment, with the same color, pattern, construction, and printed marks? The second is multiview consistency: when the same SKU is rendered across a set of views such as full body, close-up, back, and flat lay, is it the same garment in each one rather than a plausible garment in each one? The third is stability under editing, which measures how quickly identity drifts across a sequence of edits. The fourth is attribute fidelity, which checks whether machine-readable attributes such as color, pattern, neckline, sleeve, closure, and label text are preserved.
The benchmark is possible because of the structure the data layer provides. Every job carries the physical SKU it belongs to and the reference images it started from, grouped together so that an output can be compared against the exact garment it was meant to depict. On top of that structure we build a human-verified core: reference-and-output pairs judged by trained reviewers for whether the garment was held. Those judgments are the calibration data an identity benchmark needs, and the grouped structure is what makes them possible to collect at scale.
Automated metrics are only useful once they predict those human judgments. We segment the garment region in both the reference and the output, compare embeddings and color within that region, read label and logo text, and extract structured attributes. Each metric is calibrated against the human-verified core, and any metric that does not track human judgment is dropped rather than reported. The result is a single composite score for garment identity, supported by a breakdown of failure types such as color shift, pattern drift, logo corruption, and structural change.
The benchmark carries a decision rule. If the best available general model stays below the human-acceptance threshold across consecutive frontier releases, that is the signal to train. If a frontier model clears the threshold on every part of the benchmark, the case for training is closed and we continue to buy inference. In both outcomes the graph, the benchmark, and the workflow remain ours.
We expect the lasting gap to appear in sets of images rather than in single images. Single-image identity is the area where frontier labs are improving fastest. Holding a garment consistent across an entire set of views is a different problem, because it requires the system to treat several images as one physical object, and that structure is present in our data and absent from data scraped at random.
Bamboo bench is in development. We plan to publish the methodology and the results of scoring frontier models as they become available.