All cases

AI-powered corporate contact extraction system for Dun & Bradstreet

Client
Dun & Bradstreet
Industry
Business intelligence & commercial data
Services
End-to-end AI solution development
Tech stack
LLMs, Transformers, PyTorch, Spacy, Scikit-Learn, FastAPI, PostgreSQL, Kubernetes, Vertex AI, GCP

Challenge

Dun & Bradstreet provides organizations with access to business data and insights across 250 markets worldwide. To maintain its leadership in corporate intelligence, Dun & Bradstreet wanted its data management platform to deliver accurate, complete, and continuously updated contact information for millions of clients.

Dun & Bradstreet sought an AI-powered solution that could:
Continuously monitor relevant corporate websites
Extract leadership and employee information with high accuracy
Stay fully compliant with privacy regulations—only parsing websites that explicitly permitted data sharing
Scale to support multiple languages and markets
Dun & Bradstreet engaged ITRex to build an intelligent contact discovery and extraction system that would seamlessly integrate with their platform and keep the data fresh.

Our responsibilities

Our team was tasked with:
Building a robust API to interface between web sources and AI models
Developing multiple machine learning (ML) models from scratch for web pages discovery, contact extraction, and filtering
Fine-tuning a lightweight transformer model to achieve near state-of-the-art performance at lower computational cost
Aggregating, cleaning, and annotating a high-quality dataset for model training
Ensuring scalability for multilingual support (English first, with future expansion to European and Asian languages)
contact extraction system
ai-powered-corporate-contact-extraction-system

Solution

We developed a multi-layered AI-driven contact extraction system with four core models and a supporting back end:
Discovery model This is a custom-built statistical ML model that searches for relevant webpages within corporate websites. It assigns a relevance score to each webpage, ensuring only meaningful sources are processed.
Dual LLM field extraction pipeline These LLM models extract personal details like names, job titles, phone numbers, and LinkedIn profiles. For highly relevant, information-intensive sites, we integrated a SOTA proprietary LLM for maximum accuracy. For the other websites, we use a compact LLM (<1B parameters) that our team fine-tuned to achieve near-optimal performance while being more computationally efficient.
Classification & filtering model This is another ML model that we built from scratch. It filters out irrelevant contacts (e.g., guest lecturers, contractors, trainers) to ensure only permanent corporate employees are included. This model also assigns a configurable relevance score to each contact, and only entries with a score higher than a specified threshold become a part of the final output.
Back end & infrastructure We built a secure API to manage the interaction between the web crawler and AI models. Our team also adapted the client’s web crawler to scan the corporate websites that permit data sharing and assist our AI models with data extraction.
To train these models, we aggregated, cleaned, and annotated a high-volume dataset of corporate websites. To accelerate data labeling, we used a proprietary LLM to assist with preliminary annotations, which human experts then validated and improved.

Progress

The English language pipeline is already in production and we’re currently working on supporting European languages.
corporate-contact-extraction-system
corporate-contact-extraction

Impact

So far, the AI-powered corporate contact extraction tool has:
Added over a 100 million new, verified contacts to the client’s data platform
Ensured continuous accuracy and freshness of leadership and employee data
Reduced manual data verification efforts
Scaled Dun & Bradstreet’s platform capabilities, reinforcing its competitive edge

Latest projects