A leading developer of social networking apps with dozens of millions of users
Media & Entertainment
TensorFlow, Python, TensorRT, Ray, FastAPI, microservices


With live streaming going mainstream across social media platforms, live stream content moderation has exploded as a concern. Increasing regulatory pressure has induced social media organizations to act. Many have decided to reduce reliance on human moderators, no longer viewed as efficient, to invest in AI content moderation systems instead. The rising volume of inappropriate content was also an issue for our client — one of the world’s largest developers of social networking apps. Millions of minutes of live streams were broadcast on their apps daily, with more than one-third of the staff involved in the management of online content and digital communities. To ensure compliance with standards and help human moderators, the client wanted to leverage AI algorithms that would automate live stream content policing. The company was looking for experienced AI/ML and MLOps engineers. They had cooperated with ITRex previously to create functionality for their apps, roll out cloud infrastructure using DevOps, and build back-end services. They knew that ITRex has rich experience in ML development and MLOps , and they turned to us once again.

Specifically, our team took on the following challenges:
Develop a computer vision model using ML — in particular, deep learning — for sampling and analyzing live streams, and taking actions accordingly
Apply MLOps best practices to accelerate the deployment of the ML model to perform inference

Our approach:

ML tool for live stream content moderation As the first step, our AI/ML engineers developed a powerful image classifier using a convolutional neural network (CNN). Its algorithms were designed to take a screenshot of every broadcast at a regular interval for unsafe content analysis to approve or escalate it to human moderators. The key features of this ML-powered content moderation tool were:
A CNN-based image classifier trained using TensorFlow
Ability to classify images across multiple categories
Sequential image processing
Deployed on a cluster of AWS EC2 instances with auto-scaling
Run on central processing units (CPUs), a solution that was quite cost-effective and sufficient for inference tasks when the client’s user base and the amount of user-generated content were relatively small
ML model inference optimization As the user base of the client’s apps increased significantly, CPU performance was no longer matching the scale of user-generated broadcasts. The number of live stream samples for analysis rose to a few thousand per second, and the inference speed of the ML model became a bottleneck. It was critical to find a more efficient and cost-effective hardware solution so that the ML model could process images at scale and at speed. Our MLOps engineers suggested moving the model to a graphical processing unit (GPU) to take advantage of its ability to perform a mass amount of processing operations simultaneously. This move delivered unprecedented efficiency gains, making deployment many orders of magnitude faster and cheaper. In particular, the following was implemented:
Adoption of MLOps best practices to manage processes across the lifecycle of the ML model
One ML model per GPU run on an auto-scaled cluster of microservices.
Batch processing of images instead of sequential processing
New Python libraries added, including Ray to improve service architecture and deployment and TensorRT to optimize model performance on GPU
Implementation of REST API service with FastAPI


ML model inference optimization
Efficient real-time moderation of live streaming content using deep learning algorithms to achieve compliance with numerous regulations around broadcasting and the company’s safety policies
ML model optimization:
10х improvement in the ML model’s throughput performance, as the processing speed increased from a few dozen images per second on four CPUs and more to up to 500 images per second on only one GPU
14x reduction in deep learning infrastructure costs after model transfer to GPU, with the number of instances dropping from around 200 to four per second

Latest projects