←←All cases

AI-powered corporate contact extraction system for Dun & Bradstreet

Client

Dun & Bradstreet

Industry

Business intelligence & commercial data

Services

End-to-end AI solution development

Tech stack

LLMs, Transformers, PyTorch, Spacy, Scikit-Learn, FastAPI, PostgreSQL, Kubernetes, Vertex AI, GCP

Challenge

Dun & Bradstreet provides organizations with access to business data and insights across 250 markets worldwide. To maintain its leadership in corporate intelligence, Dun & Bradstreet wanted its data management platform to deliver accurate, complete, and continuously updated contact information for millions of clients.

Dun & Bradstreet sought an AI-powered solution that could:

●

Continuously monitor relevant corporate websites

●

Extract leadership and employee information with high accuracy

●

Stay fully compliant with privacy regulations—only parsing websites that explicitly permitted data sharing

●

Scale to support multiple languages and markets

Dun & Bradstreet engaged ITRex to build an intelligent contact discovery and extraction system that would seamlessly integrate with their platform and keep the data fresh.

Our responsibilities

Our team was tasked with:

●

Building a robust API to interface between web sources and AI models

●

Developing multiple machine learning (ML) models from scratch for web pages discovery, contact extraction, and filtering

●

Fine-tuning a lightweight transformer model to achieve near state-of-the-art performance at lower computational cost

●

Aggregating, cleaning, and annotating a high-quality dataset for model training

●

Ensuring scalability for multilingual support (English first, with future expansion to European and Asian languages)

ai-powered-corporate-contact-extraction-system

Solution

We developed a multi-layered AI-driven contact extraction system with four core models and a supporting back end:

●

Discovery model This is a custom-built statistical ML model that searches for relevant webpages within corporate websites. It assigns a relevance score to each webpage, ensuring only meaningful sources are processed.

●

Dual LLM field extraction pipeline These LLM models extract personal details like names, job titles, phone numbers, and LinkedIn profiles. For highly relevant, information-intensive sites, we integrated a SOTA proprietary LLM for maximum accuracy. For the other websites, we use a compact LLM (<1B parameters) that our team fine-tuned to achieve near-optimal performance while being more computationally efficient.

●

Classification & filtering model This is another ML model that we built from scratch. It filters out irrelevant contacts (e.g., guest lecturers, contractors, trainers) to ensure only permanent corporate employees are included. This model also assigns a configurable relevance score to each contact, and only entries with a score higher than a specified threshold become a part of the final output.

●

Back end & infrastructure We built a secure API to manage the interaction between the web crawler and AI models. Our team also adapted the client’s web crawler to scan the corporate websites that permit data sharing and assist our AI models with data extraction.

To train these models, we aggregated, cleaned, and annotated a high-volume dataset of corporate websites. To accelerate data labeling, we used a proprietary LLM to assist with preliminary annotations, which human experts then validated and improved.