We developed a multi-layered AI-driven contact extraction system with four core models and a supporting back end:
●
Discovery model
This is a custom-built statistical ML model that searches for relevant webpages within corporate websites. It assigns a relevance score to each webpage, ensuring only meaningful sources are processed.
●
Dual LLM field extraction pipeline
These LLM models extract personal details like names, job titles, phone numbers, and LinkedIn profiles. For highly relevant, information-intensive sites, we integrated a SOTA proprietary LLM for maximum accuracy. For the other websites, we use a compact LLM (<1B parameters) that our team fine-tuned to achieve near-optimal performance while being more computationally efficient.
●
Classification & filtering model
This is another ML model that we built from scratch. It filters out irrelevant contacts (e.g., guest lecturers, contractors, trainers) to ensure only permanent corporate employees are included. This model also assigns a configurable relevance score to each contact, and only entries with a score higher than a specified threshold become a part of the final output.
●
Back end & infrastructure
We built a secure API to manage the interaction between the web crawler and AI models. Our team also adapted the client’s web crawler to scan the corporate websites that permit data sharing and assist our AI models with data extraction.
To train these models, we aggregated, cleaned, and annotated a high-volume dataset of corporate websites. To accelerate data labeling, we used a proprietary LLM to assist with preliminary annotations, which human experts then validated and improved.