Deal Origination

What Data Sources Actually Matter for Off-Market Deal Flow?

Jeff Baehr·Dec 2025·18 min read

Last updated March 29, 2026

The data sources that matter for off-market deal flow are primary sources: government filings, state regulatory databases, industry association directories, certification bodies, professional organizations, and individual company websites. Commercial databases recycle the same underlying records, meaning every buyer filtering them sees the same companies. Praxis Rock Advisors builds vertical-specific target universes from primary-source data for each engagement, identifying companies that are absent from commercial platforms, misclassified by them, or overlooked because no other buyer is looking where we look.

The Data Problem Behind the Deal Flow Problem

Off-market deal flow is a data problem: when every buyer relies on the same commercial databases, the "off-market" label becomes marketing, not reality.

Every private equity firm wants off-market deal flow. The term appears in virtually every fund's marketing materials, every LP presentation, and every conversation with prospective investors. Yet the industry's actual sourcing practices are remarkably homogeneous. The vast majority of PE firms source deals from the same commercial databases, the same intermediary networks, and the same conference circuits. The result is a market in which "off-market" has become a marketing term rather than a description of reality.

The root cause is data. The data sources that the industry relies on for deal sourcing determine the universe of companies that buyers can see. When every buyer relies on the same data sources, every buyer sees the same universe. This is why commercial databases produce the same opportunities for every subscriber. The companies outside that universe, the genuinely off-market targets, are invisible not because they are hiding but because no one is looking where they are.

Understanding which data sources actually matter for off-market deal flow requires understanding why the data sources the industry currently uses fail to deliver it, what primary-source data is and how it differs from aggregated data, and how to evaluate data quality in the context of deal origination.

Why Commercial Databases Fail to Deliver Off-Market Deal Flow

Commercial databases fail at off-market sourcing because they aggregate the same publicly available data, creating overlapping coverage, visibility bias toward larger companies, and stale records.

Commercial databases are valuable tools for organizing information about private companies. They are not, however, tools for finding off-market deals. The distinction is important and frequently misunderstood.

These platforms are aggregators. They compile information from publicly available sources, including company websites, LinkedIn profiles, news articles, SEC filings, and government records, and organize it into searchable formats. They apply classification algorithms to categorize companies by industry, geography, size, and other attributes. They provide filters that allow users to narrow the universe to companies matching specific criteria.

The value proposition is real: these platforms save significant time compared to conducting the same research manually. But the value proposition is convenience, not exclusivity. Every subscriber to a given platform has access to the same records. Every user can run the same searches. The output of any given search is functionally identical across users, because the underlying data is the same.

This creates three specific problems for buyers seeking off-market deal flow.

Overlap in coverage. The commercial databases draw from substantially overlapping source data. A company that appears in one platform is likely to appear in the others as well, because they all scrape the same web sources and license similar third-party datasets. The marginal value of subscribing to multiple platforms is lower than the platforms' sales teams suggest, because the incremental coverage of each additional subscription is modest.

Bias toward visibility. Commercial databases are built to serve the broadest possible customer base, which means they prioritize companies that are most likely to be searched for. Large companies, companies with active web presences, companies that have received venture or PE investment, and companies in well-defined industry categories are over-represented. Small companies, companies with minimal digital footprints, companies operating under holding structures or DBAs, and companies in niche verticals are systematically under-represented. The databases are optimized for the companies that are easiest to find, which are, by definition, the companies that are least likely to be off-market.

Stale and inaccurate records. Commercial databases are updated on commercial timelines, not regulatory ones. A company's revenue estimate may be based on data that is two or three years old. Its industry classification may reflect a business model it abandoned years ago. Its ownership information may not reflect a recent generational transition. Buyers relying on these records are making outreach decisions based on information that may be materially inaccurate.

None of these problems are the fault of the platforms themselves. They are inherent limitations of the aggregation model. Aggregators can only organize information that already exists in accessible, digital form. They cannot create information about companies that have no web presence, file no public documents, and operate entirely below the threshold of digital visibility.

What Primary-Source Data Means

Primary-source data is information created by the entity it describes or its regulatory body, including state licenses, federal permits, industry certifications, and association directories.

Primary-source data, in the context of deal origination, is information created by the entity it describes or by the regulatory body that oversees it. It is not compiled, interpreted, or filtered by an intermediary. It exists at the point of origin.

The primary sources most relevant to PE deal origination fall into several categories.

State business registrations. Every company that operates as a legal entity in the United States is registered with the secretary of state (or equivalent office) in its state of incorporation and, often, in every state where it conducts business. These registrations include the company's legal name, registered agent, formation date, and status. They do not include revenue or employee count, but they provide a verified record of every legal entity operating in a given state.

State regulatory and licensing databases. Companies operating in regulated industries must obtain licenses or permits from the relevant state agencies. A propane distribution company must hold a propane dealer license from the state fire marshal or equivalent authority. A home health agency must hold a license from the state department of health. An environmental remediation firm must hold permits from the state environmental agency. These databases identify every company authorized to operate in a given industry within a given state, regardless of its size, web presence, or inclusion in any commercial database.

Federal regulatory records. Federal agencies maintain records relevant to specific industries. The Department of Transportation maintains records of hazmat carriers. The Environmental Protection Agency maintains records of facilities with environmental permits. The Centers for Medicare and Medicaid Services maintains records of healthcare providers. These federal sources provide national coverage for industries within their jurisdiction.

Industry association directories. Trade associations and industry groups maintain membership directories that identify companies operating in their respective verticals. The National Propane Gas Association, the National Association for Home Care and Hospice, the Environmental Services Association, and hundreds of similar organizations maintain directories that are updated regularly and reflect current industry participation.

Certification body registries. Companies that hold industry-specific certifications are listed in the registries of the certifying bodies. ISO certifications, industry-specific quality certifications, safety certifications, and professional accreditations all generate registry records that identify certified companies.

Professional organization membership lists. Individual professionals who are members of industry organizations are often listed with their company affiliations. These records can identify companies that are not visible through any other channel, particularly small firms where the owner's professional memberships are the only external indicator of the company's existence.

The critical characteristic of all these sources is that they are comprehensive within their scope. A state licensing database contains every company licensed to operate in that industry in that state. It does not contain a sample, an estimate, or a curated selection. It contains the complete universe. This comprehensiveness is what makes primary-source data fundamentally different from commercial databases, which contain a curated and incomplete subset of the actual market.

Examples by Vertical

In fragmented verticals like propane distribution, healthcare services, and environmental services, primary-source data identifies two to five times more targets than commercial databases.

The value of primary-source data varies by vertical, depending on the regulatory structure of the industry and the characteristics of the companies within it. Several examples illustrate the range.

Propane distribution. The propane distribution industry in the United States is highly fragmented, with an estimated 3,000 to 4,000 independent distributors operating alongside a handful of national platforms. Commercial databases typically identify 800 to 1,200 of these companies, primarily the larger operators with established web presences. State propane licensing databases, DOT hazmat carrier registrations, and state fire marshal permit records collectively identify the full universe of licensed operators. For a PE firm executing a propane roll-up strategy, the difference between 1,000 visible targets and 3,500 actual targets is the difference between competing for the same acquisitions as every other buyer and having access to thousands of targets that no other buyer has identified. Independent sponsors apply primary-source data to exactly this kind of fragmented vertical.

Healthcare services. Healthcare is among the most heavily regulated industries in the United States, which means the primary-source data infrastructure is extensive. State health department facility databases, CMS provider enrollment records, state professional licensing boards, and accreditation body registries collectively identify every licensed healthcare provider in the country. For verticals like home health, behavioral health, or dental practice management, these sources identify two to four times as many potential targets as commercial databases cover. The additional targets tend to be smaller, single-location operators, precisely the kind of add-on acquisitions that are most valuable in a healthcare platform strategy.

Environmental services. Environmental remediation, waste management, and related services are regulated at both the state and federal level. State environmental agency permit databases, EPA facility records, hazardous waste transporter registrations, and industry certification registries identify the full universe of companies authorized to operate in these verticals. Many environmental services companies operate under holding company structures or DBAs that obscure the nature of their business, making them particularly difficult for commercial databases to classify correctly. Primary-source data identifies them by what they are licensed to do, not by what their website says.

Specialty distribution. Distribution businesses across various verticals, from industrial supplies to food service to building materials, are often licensed or registered with industry-specific regulatory bodies. State weights and measures departments, commodity-specific licensing authorities, and industry association directories provide coverage that commercial databases lack, particularly for small, regional distributors that serve niche markets.

Financial services. Registered investment advisors, insurance agencies, mortgage brokers, and other financial services firms are licensed by state and federal regulators. SEC and FINRA databases, state insurance department records, and NMLS (Nationwide Multistate Licensing System) records provide comprehensive coverage of licensed financial services firms. These sources are particularly valuable for identifying small, independent firms that are candidates for roll-up strategies in wealth management, insurance distribution, or mortgage origination.

In each of these verticals, the pattern is the same: primary-source data identifies the complete universe of companies authorized to operate in the vertical, while commercial databases identify a subset that is biased toward larger, more visible operators. The gap between the two represents the actual off-market opportunity.

How to Evaluate Data Quality for Deal Origination

Data quality for deal origination should be evaluated across four dimensions: comprehensiveness, accuracy, granularity, and timeliness, applied source by source rather than by category.

Not all data is equally useful for deal origination. The quality of a data source should be evaluated across four dimensions.

Comprehensiveness. Does the source cover the complete universe of relevant companies, or a subset? State licensing databases are comprehensive within their scope: they contain every licensed operator. Commercial databases are subsets: they contain the companies they have chosen to index. For deal origination, comprehensiveness is the most important quality dimension, because missing a target entirely is worse than having incomplete information about a target that has been identified.

Accuracy. Is the information in the source correct and current? Government regulatory databases are generally accurate because the information is provided by the companies themselves under penalty of law, and is updated on regulatory timelines. Commercial databases are less reliably accurate because they rely on web scraping and third-party data that may be outdated. Accuracy matters most for attributes that drive outreach decisions: is the company still operating, is it in the right vertical, and is it in the right geography?

Granularity. Does the source provide enough detail to evaluate the target against the buyer's criteria? State licensing databases provide verified information about what a company is licensed to do and where it operates, but they typically do not include revenue or employee count. Commercial databases provide estimated revenue and employee count, but these estimates may be inaccurate. The most effective approach combines primary-source data for identification and verification with operational proxies for size estimation: fleet size from DOT records, facility count from permit databases, or provider headcount from CMS enrollment records.

Timeliness. How frequently is the source updated? Regulatory databases are updated on regulatory timelines, which vary by source. Some are updated in real time as filings are processed. Others are updated quarterly or annually. Commercial databases are updated on commercial timelines, which may lag regulatory timelines by months or years. For deal origination, timeliness matters most for identifying trigger events: a new filing, a permit modification, or a change in ownership structure that signals a potential acquisition opportunity.

The evaluation framework should be applied source by source, not category by category. Not all government databases are equally comprehensive or timely. Not all commercial databases are equally inaccurate. The goal is to assemble a portfolio of sources that collectively provides comprehensive, accurate, granular, and timely coverage of the target vertical.

Building a Primary-Source Data Strategy

PE firms can build primary-source data infrastructure internally, partner with a specialist, or adopt a hybrid approach based on their resources, deal volume, and technical capabilities.

For PE firms considering a primary-source data strategy for deal origination, the practical question is how to begin. The answer depends on the firm's resources, technical capabilities, and deal volume.

Option 1: Build internally. Firms with dedicated data science or technology teams can build primary-source data infrastructure in-house. This requires identifying the relevant sources for each target vertical, building extraction and normalization pipelines, developing entity resolution systems, and maintaining the infrastructure as sources change their formats and access methods. The advantage is full control over the data and the systems. The disadvantage is significant upfront investment and ongoing maintenance costs that are difficult to justify unless the firm's deal volume is high enough to amortize the investment across many engagements.

Option 2: Partner with a specialist. Firms that lack internal data science capabilities, or that prefer to allocate their resources to investment activities rather than data infrastructure, can partner with a specialized deal origination program that has already built the primary-source data infrastructure. Praxis Rock Advisors operates this model, building bespoke target universes from primary-source data for each client engagement. The advantage is immediate access to comprehensive primary-source coverage without the upfront investment. The disadvantage is that the data infrastructure is shared across the partner's client base, although the target universes built for each engagement are proprietary to the client.

Option 3: Hybrid approach. Some firms build internal capabilities for the verticals they target most frequently while partnering with specialists for new verticals or one-time searches. This approach balances control with efficiency and is well-suited to firms with a defined set of core verticals and periodic interest in adjacent sectors.

Regardless of the approach, the strategic imperative is the same: the data sources that define a firm's target universe determine the competitive dynamics of its deal sourcing. Firms that rely exclusively on commercial databases are competing for the same targets as every other subscriber. Firms that incorporate primary-source data into their sourcing process are accessing a materially larger and less competitive opportunity set.

The data sources that actually matter for off-market deal flow are the ones that no one else is using. In 2025, that means primary-source data. For a detailed look at how AI origination uses these sources, see the comparison with traditional buyside advisory.

Share on LinkedIn Share on X

Frequently Asked Questions

Yes. The vast majority of state and federal regulatory databases are public records, accessible to any person or entity. Many are available online through state agency websites, though the format, search functionality, and ease of bulk access vary significantly by state and agency. Some databases offer downloadable files or API access. Others require manual searches or formal public records requests. The challenge is not access but extraction: converting the raw data from hundreds of individual sources into a unified, searchable dataset requires significant technical infrastructure. This is the primary reason that commercial databases have not comprehensively incorporated government regulatory data. The extraction and normalization process is labor-intensive and does not scale in the way that web scraping does. For firms or partners willing to invest in this infrastructure, the data is available and the coverage it provides is substantially broader than what commercial platforms offer.

The ratio varies by vertical and geography, but in fragmented verticals with strong regulatory frameworks, primary-source data typically identifies two to five times as many potential targets as the leading commercial databases. In propane distribution, for example, state licensing databases identify approximately 3,500 licensed operators nationwide, while commercial databases typically cover 800 to 1,200. In home health services, CMS provider enrollment records identify approximately 11,000 certified agencies, while commercial databases cover 4,000 to 6,000. The additional targets surfaced by primary-source data tend to be smaller, more niche, and less visible, which is precisely why they represent genuine off-market opportunities. They are the companies that no other buyer has identified because no other buyer is looking in the data sources where they appear.

Generally, no. Government regulatory databases contain information about what a company is licensed to do, where it operates, and its compliance history, but they do not contain revenue, EBITDA, or other financial metrics. Financial information for private companies must be estimated using operational proxies derived from primary-source data and supplemented by other sources. For example, a propane distributor's revenue can be estimated from its fleet size (derived from DOT records), the number of locations it operates (derived from state permits), and industry benchmarks for revenue per truck or revenue per location. A home health agency's revenue can be estimated from its Medicare claims data (available through CMS) and its service area. These estimates are not precise, but they are sufficient to prioritize targets for outreach and to identify companies that fall within the buyer's size criteria. Precise financial information is obtained through direct engagement with the target, which is the appropriate stage for that level of detail.

Multi-state operators present an entity resolution challenge that is one of the most technically demanding aspects of primary-source deal origination. A company operating in five states may appear in five different state licensing databases under slightly different names, different addresses, and different entity structures. The entity resolution process matches these records by analyzing multiple attributes, including legal entity name, registered agent, principal address, officer names, and operational characteristics, to determine which records refer to the same underlying company. AI-driven entity resolution systems can perform this matching at scale with high accuracy, though human review is typically required for ambiguous cases. The result is a unified profile of each target that reflects its full geographic footprint, not just its presence in a single state's database. This multi-state view is particularly valuable for buyers seeking targets with regional or national operations, as it reveals the true scope of companies that may appear small in any single state's records.

The most common reason is the absence of a meaningful digital footprint. Commercial databases rely heavily on web scraping to identify and classify companies. A company that does not have a website, does not maintain a LinkedIn page, has not been mentioned in any news article, and has not received any form of institutional investment is effectively invisible to the web-scraping infrastructure that commercial platforms depend on. This is not a rare occurrence. In fragmented verticals with many small, owner-operated businesses, a significant portion of the market has minimal or no digital presence. These companies operate through word-of-mouth referrals, long-standing customer relationships, and local reputation rather than through digital marketing. They are licensed, they are operating, and they may be generating millions of dollars in revenue, but they do not exist in the digital records that commercial databases are designed to index. Primary-source data identifies these companies through their regulatory filings and industry affiliations, which exist regardless of their digital presence.

195 Targets in the Time One Analyst Surfaces One Name: AI Is Breaking the PE Deal Team Model

65% of PE executives are piloting AI in the investment process. Early movers report 36% more direct deals and 70% fewer diligence hours. The job shifts.

Jeff Baehr · May 2026

Deal Origination

Buy-Side M&A Advisory Explained: Process, Fees, and When You Need One

Buy-side M&A advisors charge 0.5-2% of transaction value to source and close acquisitions. Here's the full process, fee math, and decision framework.

Jeff Baehr · Mar 2026

Deal Origination

AI Deal Origination in Private Equity: How It Works in 2026

PE firms see just 16.5% of relevant deals. AI deal origination changes that with primary data, predictive scoring, and autonomous outreach at scale.

Jeff Baehr · Mar 2026

Ready to see what this infrastructure can do for your firm?

Schedule a Conversation

What Data Sources Actually Matter for Off-Market Deal Flow?

The Data Problem Behind the Deal Flow Problem

Why Commercial Databases Fail to Deliver Off-Market Deal Flow

What Primary-Source Data Means

Examples by Vertical

How to Evaluate Data Quality for Deal Origination

Building a Primary-Source Data Strategy

Frequently Asked Questions

Are government databases really accessible for deal sourcing purposes?

How many additional targets does primary-source data typically surface compared to commercial databases?

Does primary-source data include financial information about target companies?

How does primary-source data handle companies that operate across multiple states?

What is the most common reason a company is missing from commercial databases?

Related Articles

195 Targets in the Time One Analyst Surfaces One Name: AI Is Breaking the PE Deal Team Model

Buy-Side M&A Advisory Explained: Process, Fees, and When You Need One

AI Deal Origination in Private Equity: How It Works in 2026