← Back to Results

Profiles Methodology


Emsi Profile Analytics is built from individual profiles of over a hundred million workers in the United States. Typical fields available are city/state/nation of residence, job history, education history, and skills. Many profiles also contain names, phone numbers, and email addresses, but these are not made available in bulk to Emsi users.

Profiles data can provide unprecedented levels of detail for labor market analytics, especially with regard to worker skill-sets, career paths, company-level human capital, school-level alumni outcomes, and more.

Data Sources

For proprietary and confidentiality/non-disclosure reasons, Emsi cannot provide a detailed list of data sources. However, via our small number of main sources, Emsi has aggregate access to a multitude of secondary sources, some of which provide thousands of profiles, and others millions. Emsi’s sources provide access to profiles found in the following types of secondary sources:

  • Publicly available information from the web
  • Third-party resume databases and job boards
  • The recruiting industry
  • Opt-in data from employers and applicant tracking systems (ATSs)
  • Sales and marketing CRM databases
  • Various consumer/identity databases

Our primary sources aggregate multiple other sources, and don’t always break down their own sources. Because of this, and for confidentiality reasons, we do not provide an exact source count. The total count of our deduplicated profiles (~130 million as of late 2019) is a much better measure of coverage than a count of sources.

Processing Profiles


First, data from all sources is standardized to a common format. This is necessary to simplify processing and enable deduplication across sources. Sources which do not support certain fields common to other sources are marked as having missing values.


We use several strongly unique and personally identifiable fields or combinations of fields such as name/email, name/phone, online URL(s), etc. to match profiles with each other. Matched profiles are collected into duplicate groups representing a single person. Our matching method prioritizes accurate matches over finding every single duplicate. Sometimes one of our data sources has already attempted deduplication; in those cases we still treat the resulting profile as a potential duplicate with other profiles. This process results in groups of interconnected profiles that we’ve determined are duplicates.
Occasionally, due to bad source data, this process results in a duplicate group containing hundreds of source profiles. When this occurs, we have a threshold for acceptable group size and discard data from these groups — but this only affects a tiny fraction of source profiles.

Field Merging

Within each duplicate profile group that represents a single person’s “master” profile, we then need to merge the data for each field from multiple source profiles. We use customized similarity functions for each field to determine if a location, job, or degree is a duplicate of one already seen, then merge them together into a single sequence per master profile.


Finally, we classify various profile field values such as company and school into a smaller number of fixed categories to enable meaningful aggregation and analysis. We call this process “normalization.”

An example of normalization would be to normalize free-form variations of “St. Louis, Missouri” as found in different profiles to “St. Louis. MO”. One person might list their location as “Saint Louis Missouri”; another might list “ST Louis MO”; and a third might list “St. Louis Missouri”. Normalization corrects all variations of this city name to “St. Louis, MO”. Without the normalization step, aggregate analysis of profiles would be impossible–searching for profiles in “St. Louis Missouri” would automatically exclude profiles where the person wrote their location as “Saint Louis Missouri”.

  • Geographic Location: Emsi uses Google Places API to standardize locations to city, state, and county.
  • Job History:
    • Company: Emsi maintains a third-party database of nearly 10 million companies and assigns a freeform company name from a profile to one of these entities. Currently only company headquarters are tracked, not local branches/establishments.
    • Standardized Job Title: Emsi maintains a classifier that accepts a job title and description and, based on text in those fields, assigns one of over 5,000 “standard” job titles.
    • O*NET Code: Emsi maintains a classifier that accepts a job title and description and, based on text in those fields, assigns an O*NET code.
  • Education History:
    • School: Emsi maintains a proprietary database of over 20,000 postsecondary institutions (including distinct campuses and some sub-departments/colleges/schools) and software that assigns a provided school name to one of these entities.
    • Degree Level: Emsi maintains a classifier that converts freeform text to one of five standard levels (HS/GED, associate’s, bachelor’s, master’s, doctorate/professional).
    • Field of Degree (CIP code): Emsi maintains a classifier that converts freeform text naming a college major to a CIP 2010 code.
    • Skills: Emsi maintains a set of over 20,000 recognized skills and a context-aware extraction tool that identifies these skills in profile text.

Filtering “Usable” Profiles

After normalization, we apply some filters to profiles before displaying them in Emsi Profile Analytics. Profiles are filtered out if they meet one or more of these conditions:

  • The most recent job title indicates the person is no longer in the labor force (e.g. retired, volunteer, full-time parenting)
  • We do not know the person’s U.S. state of residence
  • The profile does not have a job history, education history, or skills

Filtering removes a few million profiles.

Submit a Question

Let us know what specific questions we can help you with (we may even add your question to our knowledge base).


Submit a Question

Let us know what specific questions we can help you with (we may even add your question to our knowledge base).