← Back to Results

Emsi Data Basic Overview

Fundamentals of Emsi Data

Emsi gathers and integrates economic, labor market, demographic, education, profile, and job posting data from dozens of government and private-sector sources, creating a comprehensive and current dataset that includes both published data and detailed estimates with full United States coverage.

Industry, occupation, education, demographic, job postings, and profiles data are available at national, state, metropolitan area, and county levels. ZIP code estimates are available for employment, earnings, job change, and demographics data. A complete list of Emsi data sources can be found here.

Frequency and Recency of Emsi Data

Emsi’s core LMI data (industry, occupation, education, demographics) is updated with the Emsi quarterly datarun. Each datarun contains the latest data from each of Emsi’s sources. The datarun is released early in the quarter (e.g. the Q2 datarun is typically released in April). Release notes for Emsi’s dataruns can be found here. Release notes also contain information on the age of the major sources that go into Emsi data.

Job postings and profile data are updated every two weeks as the models used to tag various fields is improved. New job postings are available every month, and the latest month’s postings are added a few days into the following month (e.g. September postings are available a few days into October). The changelog for job postings data can be found here. New profiles and updates from Emsi’s sources for profiles are incorporated quarterly. The changelog for profiles data can be found here.

Job Postings

Emsi job postings data is gathered by scraping over 100,000 websites, including company career sites, national and local job boards, and job posting aggregators. Over 1.5 million companies are represented in Emsi data.

Job postings are assessed for likely duplicates using a machine learning algorithm, which determines whether two postings are duplicates based on text similarity, job title, company name, and location. Job postings posted more than six weeks apart will not be considered potential duplicates. Duplicate jobs openings posted in separate cities will not be deduplicated and will appear as multiple postings.

Each job posting is further enriched with value-add processes including
• Job title and company standardization
• Skill extraction and tagging
• SOC and NAICS code determination and assignment
• Education and experience determination

More detail on Emsi’s job postings process is available here.

Profiles

This dataset contains profiles of individual people in the workforce. Each profile contains information unique to each individual, such as job title, company, skills, and education information. Emsi’s profile database currently contains profiles for over a hundred million distinct individuals.

Emsi profiles data is gathered from publicly available information on the web, third-party resume databases and job boards, the recruiting industry, opt-in data from employers and applicant tracking systems, sales and marketing CRM databases, and various consumer/identity databases.

As with job postings, machine learning algorithms are used to deduplicate profiles and enrich the raw data contained in each profile—job titles and company names are standardized, skills are extracted, and education information is standardized.

More information on Emsi’s profiles data is available here.

Industries

Industry data is the backbone of Emsi’s core LMI data. Emsi industry data is data about businesses, categorized by type—hospitals, oil refineries, grocery stores, etc. The Bureau of Labor Statistics’ Quarterly Census of Employment and Wages (QCEW) dataset provides detailed employment counts and earnings information for 95% of the employed workforce in the United States, broken out by industry. The employment counts data provided by this dataset are the gold standard of employment counts throughout Emsi data. Where necessary, Emsi fills in suppressed data points in QCEW using data from the Census’s County Business Patterns (CBP) dataset. More information on the extent of suppressions in QCEW and the importance of Emsi’s unsuppression processes, see this article.

Emsi uses other datasets to provide data for the remaining 5% of the employed workforce not covered by QCEW. Emsi uses American Community Survey (ACS) data to provide job counts and earnings data for self-employed workers. Industry job counts and earnings data are available back to 2001.

Emsi projects industry job counts data 10 years into the future. Three historical trendlines (last 5 years, last 10 years, last 15 years) are projected forward 10 years and averaged, yielding a raw projected trendline. This trendline is then adjusted slightly by taking into account the BLS’s National Industry-Occupation Employment Matrix (NIOEM) dataset, which contains national-level employment projections. Emsi then adjusts the trendlines to state-level projections published by state LMI offices, yielding Emsi final industry projections data. A full explanation of the process can be found here. Industry earnings data are not projected.

Occupations

Occupation data presents employment and wage information, categorized by worker type—Registered Nurses, Welders, Web Developers, etc. Occupation job counts are generated by taking industry job counts from QCEW and combining them with staffing patterns from the BLS’s Occupational Employment Statistics (OES) dataset. Staffing patterns are unique to industries and show the percentage breakout of each industry into its component occupations. Emsi regionalizes OES staffing patterns, creating location-specific staffing patterns that take into account the region’s particular industry mix. The result is tailored staffing patterns that generate location-specific occupation employment data.

Basic occupation earnings data come from OES as well. Emsi unsuppresses earnings data where necessary and models the MSA-level earnings native to OES down to the county level. Although OES is not published as a time series, Emsi has developed one using historical OES data. This time series offers several benefits, including historical occupation earnings back to 2005, reduced volatility between years of published OES data, and the ability to use historical years of OES to unsuppress latest year OES data. More information on Emsi’s occupation process and historical OES time series is available here.

In some of its products, Emsi also provides earnings estimates for job titles layered with skills. Traditional government LMI provides earnings data for occupations, but job titles and skills are more granular than occupations. Emsi derives estimates for job titles and skills by combining compensation data from more granular worker profiles with occupation-level earnings data from OES using a special compensation model. More information about the compensation model is available here.

Like industry employment data, occupation employment data goes back to 2001 and is also projected 10 years into the future. Projections are generated by applying projected staffing patterns to Emsi’s projected industry employment data. Occupation earnings data are not projected.

Education

Emsi provides data on college enrollments and graduates, as reported in the National Center for Education Statistics’ (NCES) IPEDS dataset. This includes gender and race/ethnicity data for enrollees by school; graduates by school, CIP code, award level, gender, and race/ethnicity; and data on distance completions, as well as information on tuition and other student fees.

IPEDS publishes updates to various aspects of the data throughout the year, and Emsi incorporates the updates as they become available. Generally new completions data is published in late summer.

For more information on the timing of IPEDS updates, see this article.

Demographics

Demographics data largely comes from the Census Bureau’s Population Estimates Program and are published by the Census down to the county level. Emsi demographics show population breakouts by age group, gender, and race/ethnicity.

Emsi creates estimates at the ZIP code level by using American Community Survey (ACS) data to model down to the Census Tract level, then using a tract-to-ZIP code mapping from the Department of Housing and Urban Development (HUD) to map from tracts up to ZIP codes. For more information on the creation of ZIP-level demographics, click here.

Emsi uses a cohort model to project demographics data forward 10 years.

Submit a Question

Let us know what specific questions we can help you with (we may even add your question to our knowledge base).

Related

Submit a Question

Let us know what specific questions we can help you with (we may even add your question to our knowledge base).