This article outlines the creation of Emsi’s job postings data, from the collection of postings to enrichment of the data.
It is important to note that job postings are not necessarily the same as job vacancies; there is a correlation, but many recruitment practices make it an imperfect relationship. Job postings are a measure of recruitment marketing by employers purportedly looking to fill job vacancies.
Emsi’s postings data is gathered by scraping over 100,000 websites, including company career sites, national and local job boards, and job posting aggregators. Postings for over 1.5 million companies are scraped.
Users often ask about the absence of postings from LinkedIn and Indeed in Emsi’s job postings. Both sources have asked that their sites not be scraped for postings; therefore Emsi does not collect or display postings from either source.
Job postings are assessed for likely duplicate postings, which are singularized when sufficient data is present. Deduplication is the process of identifying duplicate job postings that are connected to the same vacancy. Multiple copies of a particular posting are often scraped from various sources on the internet. Rather than allowing these duplicates to artificially inflate the posting count, Emsi deduplicates the data before presenting it for analysis.
The deduplication process uses a machine learning algorithm to determine whether two job postings are duplicates. Two postings that are duplicates usually are not exactly identical. The deduplication process uses a statistical classifier that has been trained to detect duplicates by comparing a number of fields in the postings, including location, job title, similarity of posting text, contact information in the posting, and company name. Duplicate job postings posted in separate cities will not be deduplicated and will appear as multiple job posts.
Duplicate postings are stored and tracked along with original postings, ensuring that both total and unique (deduplicated) posting counts are available.
In addition to the deduplication process described above, job postings are deduplicated over time to account for new postings appearing for the same vacancy after the other postings for the vacancy expired.
A vacancy is considered expired or closed when there are zero active postings for it among all of its duplicate postings. For instance, a vacancy with three total postings is considered expired when all three associated postings are no longer active. However, there are cases in which a vacancy can expire and another posting will appear for it after its expiration. In cases like these, if the new posting appears within six weeks of the vacancy’s expiration, we revive the vacancy and count the new posting as another duplicate. Job postings more than six weeks apart will not be considered potential duplicates if all prior postings have expired.
Once the postings data is scraped and deduplicated, it undergoes further enrichment and cleaning.
A company (advertiser) is assigned to each job posting based on the text present in the posting. This data includes normalized company name, NAICS (industry) code, company size, company location, whether the company is a staffing company, and other information. All subsidiary entities are reported as the top-level corporate enterprise.
Emsi assigns an education level to each posting using a machine learning model to detect the presence of required or preferred education levels. If more than one education level is mentioned, the posting will be tagged with all levels mentioned. Potential values include Unspecified, High School/GED, Associate’s Degree, Bachelor’s Degree, Master’s Degree, or Ph.D./Professional Degree.
Postings are tagged as full-time (more than 32 hours) or part-time (32 hours or less). If the posting does not specify, full-time is assumed.
Years of experience required for the position is captured where available.
Country, city, and state information are usually present in the postings and are easily retrieved during the collection process. City-state is generally shown in the posting as it was captured from the posting during scraping. This location represents the location of the posting and may not represent the location of the job vacancy. It is not uncommon for companies to post a job in other markets to attract talent.
Emsi also maps postings to traditional MSAs using a mapping that links MSAs to the city-state combinations found in job postings. A similar process is used to map city-states to counties.
Most US cities are geographically located in only one county, but some span several counties. When a posting is found in one of these cities, a weighted dice roll is used to determine which county to assign the vacancy to. The counties are weighted based on the number of business addresses present within each county as determined by the USPS Delivery Statistics (DelStat) dataset. A two-county city in which one county has two times the business addresses of the other will generally have twice the postings assigned to it.
Read more about Emsi skills here.
Some job postings include the salary or salary range of the vacancy. Emsi extracts and cleans this information and includes it in the dataset when it is a likely and reasonable reflection of the position. Approximately 20% of all job postings contain salary information; this amount varies by industry. It is important to note that this is an advertised salary and not labor market data.
For more information about using advertised salary data, see this article.
All job postings are scanned for the presence of language indicating that the advertised position can be filled by a remote worker. This involves analyzing the text of each posting’s title and body for remote language. Many words and phrases are used to indicate a remote position, including “remote”, “position can be located anywhere”, “work from home”, “telecommute”, and others. Postings containing language indicative of a remote role are flagged as remote. It should be noted that this definition is broad enough to include postings that require that a person live in a particular region although coming in to an office is not required.