The Data Problem Behind the Deal Flow Problem
Off-market deal flow is a data problem: when every buyer relies on the same commercial databases, the "off-market" label becomes marketing, not reality.
Every private equity firm wants off-market deal flow. The term appears in virtually every fund's marketing materials, every LP presentation, and every conversation with prospective investors. Yet the industry's actual sourcing practices are remarkably homogeneous. The vast majority of PE firms source deals from the same commercial databases, the same intermediary networks, and the same conference circuits. The result is a market in which "off-market" has become a marketing term rather than a description of reality.
The root cause is data. The data sources that the industry relies on for deal sourcing determine the universe of companies that buyers can see. When every buyer relies on the same data sources, every buyer sees the same universe. This is why commercial databases produce the same opportunities for every subscriber. The companies outside that universe, the genuinely off-market targets, are invisible not because they are hiding but because no one is looking where they are.
Understanding which data sources actually matter for off-market deal flow requires understanding why the data sources the industry currently uses fail to deliver it, what primary-source data is and how it differs from aggregated data, and how to evaluate data quality in the context of deal origination.
Why Commercial Databases Fail to Deliver Off-Market Deal Flow
Commercial databases fail at off-market sourcing because they aggregate the same publicly available data, creating overlapping coverage, visibility bias toward larger companies, and stale records.
Commercial databases are valuable tools for organizing information about private companies. They are not, however, tools for finding off-market deals. The distinction is important and frequently misunderstood.
These platforms are aggregators. They compile information from publicly available sources, including company websites, LinkedIn profiles, news articles, SEC filings, and government records, and organize it into searchable formats. They apply classification algorithms to categorize companies by industry, geography, size, and other attributes. They provide filters that allow users to narrow the universe to companies matching specific criteria.
The value proposition is real: these platforms save significant time compared to conducting the same research manually. But the value proposition is convenience, not exclusivity. Every subscriber to a given platform has access to the same records. Every user can run the same searches. The output of any given search is functionally identical across users, because the underlying data is the same.
This creates three specific problems for buyers seeking off-market deal flow.
Overlap in coverage. The commercial databases draw from substantially overlapping source data. A company that appears in one platform is likely to appear in the others as well, because they all scrape the same web sources and license similar third-party datasets. The marginal value of subscribing to multiple platforms is lower than the platforms' sales teams suggest, because the incremental coverage of each additional subscription is modest.
Bias toward visibility. Commercial databases are built to serve the broadest possible customer base, which means they prioritize companies that are most likely to be searched for. Large companies, companies with active web presences, companies that have received venture or PE investment, and companies in well-defined industry categories are over-represented. Small companies, companies with minimal digital footprints, companies operating under holding structures or DBAs, and companies in niche verticals are systematically under-represented. The databases are optimized for the companies that are easiest to find, which are, by definition, the companies that are least likely to be off-market.
Stale and inaccurate records. Commercial databases are updated on commercial timelines, not regulatory ones. A company's revenue estimate may be based on data that is two or three years old. Its industry classification may reflect a business model it abandoned years ago. Its ownership information may not reflect a recent generational transition. Buyers relying on these records are making outreach decisions based on information that may be materially inaccurate.
None of these problems are the fault of the platforms themselves. They are inherent limitations of the aggregation model. Aggregators can only organize information that already exists in accessible, digital form. They cannot create information about companies that have no web presence, file no public documents, and operate entirely below the threshold of digital visibility.
What Primary-Source Data Means
Primary-source data is information created by the entity it describes or its regulatory body, including state licenses, federal permits, industry certifications, and association directories.
Primary-source data, in the context of deal origination, is information created by the entity it describes or by the regulatory body that oversees it. It is not compiled, interpreted, or filtered by an intermediary. It exists at the point of origin.
The primary sources most relevant to PE deal origination fall into several categories.
State business registrations. Every company that operates as a legal entity in the United States is registered with the secretary of state (or equivalent office) in its state of incorporation and, often, in every state where it conducts business. These registrations include the company's legal name, registered agent, formation date, and status. They do not include revenue or employee count, but they provide a verified record of every legal entity operating in a given state.
State regulatory and licensing databases. Companies operating in regulated industries must obtain licenses or permits from the relevant state agencies. A propane distribution company must hold a propane dealer license from the state fire marshal or equivalent authority. A home health agency must hold a license from the state department of health. An environmental remediation firm must hold permits from the state environmental agency. These databases identify every company authorized to operate in a given industry within a given state, regardless of its size, web presence, or inclusion in any commercial database.
Federal regulatory records. Federal agencies maintain records relevant to specific industries. The Department of Transportation maintains records of hazmat carriers. The Environmental Protection Agency maintains records of facilities with environmental permits. The Centers for Medicare and Medicaid Services maintains records of healthcare providers. These federal sources provide national coverage for industries within their jurisdiction.
Industry association directories. Trade associations and industry groups maintain membership directories that identify companies operating in their respective verticals. The National Propane Gas Association, the National Association for Home Care and Hospice, the Environmental Services Association, and hundreds of similar organizations maintain directories that are updated regularly and reflect current industry participation.
Certification body registries. Companies that hold industry-specific certifications are listed in the registries of the certifying bodies. ISO certifications, industry-specific quality certifications, safety certifications, and professional accreditations all generate registry records that identify certified companies.
Professional organization membership lists. Individual professionals who are members of industry organizations are often listed with their company affiliations. These records can identify companies that are not visible through any other channel, particularly small firms where the owner's professional memberships are the only external indicator of the company's existence.
The critical characteristic of all these sources is that they are comprehensive within their scope. A state licensing database contains every company licensed to operate in that industry in that state. It does not contain a sample, an estimate, or a curated selection. It contains the complete universe. This comprehensiveness is what makes primary-source data fundamentally different from commercial databases, which contain a curated and incomplete subset of the actual market.
Examples by Vertical
In fragmented verticals like propane distribution, healthcare services, and environmental services, primary-source data identifies two to five times more targets than commercial databases.
The value of primary-source data varies by vertical, depending on the regulatory structure of the industry and the characteristics of the companies within it. Several examples illustrate the range.
Propane distribution. The propane distribution industry in the United States is highly fragmented, with an estimated 3,000 to 4,000 independent distributors operating alongside a handful of national platforms. Commercial databases typically identify 800 to 1,200 of these companies, primarily the larger operators with established web presences. State propane licensing databases, DOT hazmat carrier registrations, and state fire marshal permit records collectively identify the full universe of licensed operators. For a PE firm executing a propane roll-up strategy, the difference between 1,000 visible targets and 3,500 actual targets is the difference between competing for the same acquisitions as every other buyer and having access to thousands of targets that no other buyer has identified. Independent sponsors apply primary-source data to exactly this kind of fragmented vertical.
Healthcare services. Healthcare is among the most heavily regulated industries in the United States, which means the primary-source data infrastructure is extensive. State health department facility databases, CMS provider enrollment records, state professional licensing boards, and accreditation body registries collectively identify every licensed healthcare provider in the country. For verticals like home health, behavioral health, or dental practice management, these sources identify two to four times as many potential targets as commercial databases cover. The additional targets tend to be smaller, single-location operators, precisely the kind of add-on acquisitions that are most valuable in a healthcare platform strategy.
Environmental services. Environmental remediation, waste management, and related services are regulated at both the state and federal level. State environmental agency permit databases, EPA facility records, hazardous waste transporter registrations, and industry certification registries identify the full universe of companies authorized to operate in these verticals. Many environmental services companies operate under holding company structures or DBAs that obscure the nature of their business, making them particularly difficult for commercial databases to classify correctly. Primary-source data identifies them by what they are licensed to do, not by what their website says.
Specialty distribution. Distribution businesses across various verticals, from industrial supplies to food service to building materials, are often licensed or registered with industry-specific regulatory bodies. State weights and measures departments, commodity-specific licensing authorities, and industry association directories provide coverage that commercial databases lack, particularly for small, regional distributors that serve niche markets.
Financial services. Registered investment advisors, insurance agencies, mortgage brokers, and other financial services firms are licensed by state and federal regulators. SEC and FINRA databases, state insurance department records, and NMLS (Nationwide Multistate Licensing System) records provide comprehensive coverage of licensed financial services firms. These sources are particularly valuable for identifying small, independent firms that are candidates for roll-up strategies in wealth management, insurance distribution, or mortgage origination.
In each of these verticals, the pattern is the same: primary-source data identifies the complete universe of companies authorized to operate in the vertical, while commercial databases identify a subset that is biased toward larger, more visible operators. The gap between the two represents the actual off-market opportunity.
How to Evaluate Data Quality for Deal Origination
Data quality for deal origination should be evaluated across four dimensions: comprehensiveness, accuracy, granularity, and timeliness, applied source by source rather than by category.
Not all data is equally useful for deal origination. The quality of a data source should be evaluated across four dimensions.
Comprehensiveness. Does the source cover the complete universe of relevant companies, or a subset? State licensing databases are comprehensive within their scope: they contain every licensed operator. Commercial databases are subsets: they contain the companies they have chosen to index. For deal origination, comprehensiveness is the most important quality dimension, because missing a target entirely is worse than having incomplete information about a target that has been identified.
Accuracy. Is the information in the source correct and current? Government regulatory databases are generally accurate because the information is provided by the companies themselves under penalty of law, and is updated on regulatory timelines. Commercial databases are less reliably accurate because they rely on web scraping and third-party data that may be outdated. Accuracy matters most for attributes that drive outreach decisions: is the company still operating, is it in the right vertical, and is it in the right geography?
Granularity. Does the source provide enough detail to evaluate the target against the buyer's criteria? State licensing databases provide verified information about what a company is licensed to do and where it operates, but they typically do not include revenue or employee count. Commercial databases provide estimated revenue and employee count, but these estimates may be inaccurate. The most effective approach combines primary-source data for identification and verification with operational proxies for size estimation: fleet size from DOT records, facility count from permit databases, or provider headcount from CMS enrollment records.
Timeliness. How frequently is the source updated? Regulatory databases are updated on regulatory timelines, which vary by source. Some are updated in real time as filings are processed. Others are updated quarterly or annually. Commercial databases are updated on commercial timelines, which may lag regulatory timelines by months or years. For deal origination, timeliness matters most for identifying trigger events: a new filing, a permit modification, or a change in ownership structure that signals a potential acquisition opportunity.
The evaluation framework should be applied source by source, not category by category. Not all government databases are equally comprehensive or timely. Not all commercial databases are equally inaccurate. The goal is to assemble a portfolio of sources that collectively provides comprehensive, accurate, granular, and timely coverage of the target vertical.
Building a Primary-Source Data Strategy
PE firms can build primary-source data infrastructure internally, partner with a specialist, or adopt a hybrid approach based on their resources, deal volume, and technical capabilities.
For PE firms considering a primary-source data strategy for deal origination, the practical question is how to begin. The answer depends on the firm's resources, technical capabilities, and deal volume.
Option 1: Build internally. Firms with dedicated data science or technology teams can build primary-source data infrastructure in-house. This requires identifying the relevant sources for each target vertical, building extraction and normalization pipelines, developing entity resolution systems, and maintaining the infrastructure as sources change their formats and access methods. The advantage is full control over the data and the systems. The disadvantage is significant upfront investment and ongoing maintenance costs that are difficult to justify unless the firm's deal volume is high enough to amortize the investment across many engagements.
Option 2: Partner with a specialist. Firms that lack internal data science capabilities, or that prefer to allocate their resources to investment activities rather than data infrastructure, can partner with a specialized deal origination program that has already built the primary-source data infrastructure. Praxis Rock Advisors operates this model, building bespoke target universes from primary-source data for each client engagement. The advantage is immediate access to comprehensive primary-source coverage without the upfront investment. The disadvantage is that the data infrastructure is shared across the partner's client base, although the target universes built for each engagement are proprietary to the client.
Option 3: Hybrid approach. Some firms build internal capabilities for the verticals they target most frequently while partnering with specialists for new verticals or one-time searches. This approach balances control with efficiency and is well-suited to firms with a defined set of core verticals and periodic interest in adjacent sectors.
Regardless of the approach, the strategic imperative is the same: the data sources that define a firm's target universe determine the competitive dynamics of its deal sourcing. Firms that rely exclusively on commercial databases are competing for the same targets as every other subscriber. Firms that incorporate primary-source data into their sourcing process are accessing a materially larger and less competitive opportunity set.
The data sources that actually matter for off-market deal flow are the ones that no one else is using. In 2025, that means primary-source data. For a detailed look at how AI origination uses these sources, see the comparison with traditional buyside advisory.