Heavily shattered by the pandemic, the retail sector is on the lookout for innovation. Among the many technologies retailers focus on, artificial intelligence is an undeniable leader. The market of artificial intelligence solutions for retail is projected to reach $23.32 billion by 2027, quite a leap compared to $5.06 billion in 2021. Within AI, computer vision and image recognition have become notable areas of interest for the retail sector — the global market of retail image recognition software is expected to grow at a CAGR of 22% and attain the value of $3.7 billion by 2025. Bringing image recognition into their technology mixes, retailers hope to optimize inventories, simplify checkouts, and boost customer experience. In this blog post, we study how retail image recognition works, explore its applications for online and brick-and-mortar businesses, and highlight the peculiarities to keep in mind to implement image recognition for retail hassle-free. Let's start with the essentials.

What is image recognition technology?

With CCTV cameras installed in nearly every store, retailers have gathered massive volumes of visual data. In many cases, sadly, it is deemed to remain a mere collection of files. With CCTV cameras installed in nearly every store, retailers have gathered massive volumes of visual data. In many cases, sadly, it is deemed to remain a mere collection of files.
What image recognition technology does is that it teaches a computer to "understand" visual information so that it can be put to use.
For instance, image recognition enables self-checkout systems that can tell whether a product placed in front of an embedded camera is a coffee jar or a soda bottle and accurately identify its stock keeping unit (SKU).

How does retail image recognition work under the hood?

Deep learning based on convolutional neural networks (CNNs) is the prevalent technique for image recognition. A basic CNN used for retail image recognition features two components — an object detector and an object classifier. The detector spots an object in an input image, places it into a bounding box, and crops it out. And if an image features several products, the CNN crops each object out from the original image and passes them down for processing into several parallel branches. The classifier, in turn, recognizes the objects based on the knowledge gained during training on reference images. Here's how the entire process may look like when visualized:
Image recognition

On a bit more technical side of the matter…

The approach described above makes up the base for many retail image recognition models. Two of the most popular ones are R-CNN and YOLO. Both are deep learning model families, and both apply well for retail product recognition. Let's briefly recap the details about each. R-CNN The R-CNN family includes such techniques as R-CNN, Fast R-CNN, and Faster R-CNN explicitly designed for object localization and recognition. The architecture of the original R-CNN model comprises three components:
  • A region proposal module that generates bounding box candidates
  • A feature extractor that identifies features for each candidate
  • A classifier that assigns the extracted features a class label
R-CNN requires each proposed region to pass to the underlying layers of the CNN, which significantly lowers the model's operating speed. On average, it takes R-CNN 47 seconds to analyze one image. Therefore, the speedier variations of the model are mainly used today. With Fast R-CNN, an image is fed into the network once. As a result, it takes the model approximately 0.32 seconds to analyze an image, which is 146 times faster than the original R-CNN. The authors of Faster R-CNN make more improvements to the original architecture and achieve even more excellent outcomes. Faster R-CNN is ten times speedier than Fast R-CNN and 250 times speedier than R-CNN, which makes it an optimum choice for latency-critical applications. YOLO The YOLO family is a bit less accurate than the R-CNN family. Its lower predictive accuracy can be traced back to occasional localization errors. The upside of the YOLO model is its high processing speed. Operating at 45 FPS for a default version and 155 FPS for a speed-optimized version, YOLO is well-suited for real-time image recognition. The approach relies on a single neural network. Taking an image as an input, it localizes bounding boxes and directly predicts class labels for each bounding box.

Image recognition in retail: essential use cases

Businesses have started leveraging retail software solutions to achieve many goals, from optimizing inventories to ensuring an incomparable shopping experience for their customers. Here are the uses of image recognition that are gaining momentum among retailers today. Product audits According to a Stanford study, manual audits in retail proved to be time-consuming and inaccurate. An error rate may reach as high as 20%. Image recognition technology helps standardize audits to get consistent and accurate data. The information interpreted by image recognition software can help track sales trends, too. Tapping into the data on how well different brands and SKUs are selling, retailers may boost the sales of priority SKUs by placing them closer to the buyer. Planogram compliance The way products are merchandised profoundly influences buying decisions. Image recognition helps ensure that the arrangement of goods on the shelf matches the planogram. Object recognition algorithms scan a supermarket stall, detect the products, and classify them by a manufacturer, a brand, or an SKU. The solution compares the obtained results to a reference planogram and notifies retailers about mismatches, if any. Detecting empty shelves According to a study conducted by IHL Group, the worldwide retail industry misses out on $984 billion in sales due to products being out-of-stock. Image recognition helps retailers prevent losing money and customers. When an SKU is missing on the shelf, image recognition software notifies the staff of the need to replenish. Self-checkout systems and stores A self-checkout system allows customers to place their purchases in front of the camera without having to comply with the line-of-sight rule (the way barcodes do) and immediately proceed with the payment. According to numerous studies, customers find self-checkout options more convenient, fast, and enjoyable. A more advanced take on self-checkout is a cashierless store. In such advanced stores, an image recognition system takes in the data from CCTV cameras or the cameras embedded into a shopping cart to recognize the purchases and automatically charge the customer. The payment in such cases may be handled via a mobile app, a self-service kiosk, or even by scanning one's palm at a store gate. Retail AR applications Product image recognition pairs well with augmented reality technology solutions, too, enabling real-time marketing and making online shopping more convenient and engaging. The combination of techs brings all kinds of interactive experiences to life — from visualizing product catalogs (Ikea) to providing additional information on merchandised products (IBM Research) to enticing customers to pop inside a store (IBM Hugo Boss).
Helping visually impaired customers Packaged products are extremely difficult to tell apart. Image recognition software can help people with seeing disabilities shop independently by reading the labels and texts placed onto the boxes out loud.

A run-through of benefits image recognition drives in retail

Image recognition brings about significant improvements to how retail businesses run, namely:
  • The sales reps get to spend more time on sales instead of manually doing the paperwork
  • Retailers get the chance to maintain visual consistency across multiple stores within a single chain
  • Manufacturers get an opportunity to adjust production volumes based on brand performance and distribute products according to customer demand
  • Retailers prevent overstocking and stock-outs, as well as make sure customers are always served fresh products
  • Retailers sell more effectively due to analytics-driven product placement

Building an image recognition solution for retail: key points to remember

If you have your mind on implementing an image recognition system for retail, here are vital things to remember. Custom vs. library-based development You can either train a product recognition model from scratch or use an already trained deep learning model, like the previously mentioned Fast R-CNN or YOLO. Going the custom route is more time- and effort-intensive. Still, it would allow you to create a model that meets your specific needs. Going for a pre-trained deep learning model could help you cut down development efforts, but don't get tricked into thinking it can be implemented right away. Due to the specifics of data publicly available models are trained on, they often require additional training on custom datasets. The requirements for training data So, either way, you have to train the deep learning model to guarantee accurate product recognition. When assembling a training dataset, make sure you have enough data entries. Deep learning models require large volumes of annotated data, so it might become challenging to achieve high accuracy if you only have a few examples. Another point to keep in mind is the variability of the training dataset. The number of SKUs in one supermarket can reach thousands. But the datasets used for training retail image recognition models fail to represent the variety of products found on the supermarket shelves. PASCAL VOC, for example, contains 20 classes of objects, while COCO features 80 object categories. So, be ready to collect additional footage featuring diverse product categories. What adds up to the challenge is that object detection datasets powering popular product recognition models feature images taken in conditions far from natural. Hence, for the model to recognize various products in real-life situations, one needs to train the model on the footage accurately representing reality. Keeping an eye on interclass variation Apart from differentiating product classes, a retail image recognition solution should distinguish products from the same category, say, differently-flavored cookies of the same brand. The packaging of such products usually features minor differences that are difficult to recognize, even for the human eye. To ensure your deep learning model accurately tells those apart, be ready to invest time in additional data labeling. Adjusting the deep learning model Retailers regularly import new SKUs to attract customers. The packaging of products on the market changes quite frequently, too. This calls for additional training of the deep learning model powering your retail application, so it accurately recognizes new SKUs.
In the coming years, retailers are expected to leverage image recognition software to the fullest. If you want to implement a retail image recognition solution and search for a reliable partner to do so, drop us a line, and we'll help you out.