Using Machine Learning for Visual Searches
Why We Did It
At Squarespace, we have a wide variety of customers with diverse goals and even more diverse websites. Our platform hosts websites for nearly everything, from artisanal hot sauce to Tasmanian markets to Aaron Carter. However, such amazing variety also presents a challenge when developing tools for so many different use cases. How do we become familiar with such a wide range of customers to build them the products they need?
We decided to build an internal search engine to navigate all of our customer sites. We could have used traditional text search, but many Squarespace sites are heavily visual and contain a lot of images and photos. If we were to only search with text, we’d miss crucial visual and stylistic features of many Squarespace sites.
Thus, we went with a visual search engine that could find aesthetically similar sites and even search for specific visual styles.
We broke down the process of building our visual search engine into three steps:
- Get data to search
- Train a machine-learning model to make our data searchable
- Build a search index
The Data
For our dataset, we chose to focus on screen captures of a site’s front page because it’s more likely to have dynamic content than other pages on the site.
To get the screen captures, we turned to our trusty friend ImageMagick. We launched a command-line Chrome instance to visit each site, and ImageMagick took a 1,300-pixel × 900-pixel × 3-color screen capture of that site’s front page. The raw dataset came out to about 250 gigabytes.
For the first version of our model, we decided to downsize the images to 64 × 64 × 3. The full-size raw images were a bit too large to be fed easily into a GPU. Each raw image is 14.04 megabytes of uncompressed data (1,300 pixels × 900 pixels × 3 colors × 32 bits per floating-point number). The smaller size is only 49.15 kilobytes (64 pixels × 64 pixels × 3 colors × 32 bits), and lets us train our model more quickly using larger mini-batches of data.
The Model
Images have a lot of individual features including color, shapes, and patterns that we can heuristically search, but a machine-learning algorithm can capture all of that information—and encode an image into an easy-to-search format. Creating one meant that we didn’t need to specify any features manually.
A commonly used easy-to-search format is a simple standard vector (similar to ones used in high school when looking at graphs with x and y coordinates). In our case, we wanted to force our model to capture more than two dimensions of information, so instead we used vectors with a cardinality of 256. We thought vectors of that cardinality would be able to incorporate all the information of interest.
After that, we needed to find a model that would allow us to take an image and encode it into a 256-dimension vector. Fortunately, autoencoders, a class of machine-learning algorithms, have this power. For our purposes, the autoencoder would learn how to squash images into vectors from which we could recreate the original image.
We specifically tried out a slightly more advanced autoencoder that has convolutional and variational components inside of it. The variational component serves as a regularizer—a restriction to make sure the model doesn’t focus on unimportant features of an image. The convolutional components help the model recognize patterns in images. Convolutional layers in a neural net are widely used for image processing and general computer-vision tasks.
We built the model in Tensorflow and Keras, and then trained it on an AWS GPU. GPUs are very useful for training deep neural nets with convolutional layers, partially because they allow highly parallel computations and also because of specialized software we can use to speed up training.
We then watched the model training by launching a Tensorboard and checking whether our encoded images looked similar to the unencoded, original screenshots.
Once we had a fully trained model, we encoded all of our images into vectors and constructed a search index.
The Search Index
Once we had vector representations of each image, we were able to treat visual search as a nearest-neighbor problem.
Two commonly used metrics to determine nearest neighbors are cosine (angular) distance and euclidian (straight line) distance. We chose to use cosine distance because it’s typically used to encapsulate the style or orientation of a vector rather than its magnitude, which euclidian distance captures.
Cosine distance is straightforward to compute; however, a brute-force search is far too computationally expensive to perform against each site. Ideally, we would only compare the nearest 100 sites rather than searching the entire dataset. We decided to use a class of algorithms called “approximate nearest neighbors” that explicitly makes a tradeoff of accuracy for speed. Plenty of libraries implement all different kinds of approximate-nearest-neighbors algorithms, and we particularly liked Spotify’s Annoy library.
Annoy takes your entire dataset of vectors and then subdivides them into many little regions. Using Annoy, we could take a query vector, find which region the query vector fit into, and only compare the nearest neighbors on that little region’s vectors. This also dramatically reduced the number of comparisons we would have to make all the way down to even a few hundred or thousand.
Our visual search results are then the nearest neighbors that come out of Annoy when we send in an encoded image.
Results
Now, with our search index, we can check out the results and see the quality of our search engine.
Here’s an example of a robotic head we used to search:
The model finds the structure of the images fairly well. The top results all have a black background with a single circle-like structure in the middle.
Here’s another example that shows the model capturing color and structure:
Now the results show a rather open-air landscape style. The model does a good job of capturing styles, and the approximate nearest-neighbors algorithm also gives us high-quality results despite searching only a small number of screenshots.
Here’s a third example which shows that the model learns structure as well:
In general, the model does well in understanding the styles of the screenshots we take.
We don’t have a quantitative measure of the quality of the search results yet, but what we see shows that our visual search engine works quite well for perusing visually similar sites. Using this tool and the collections of sites it provides us, we can promote customers, identify common themes, and discover unique uses of the Squarespace platform. This input helps us plan for new features and products to improve the platform.