In the past months, ChatGPT has been dominating the news headlines,
and people are both excited and scared by its quite sophisticated
ability to generate texts. Besides short- and long-form text generation,
there are quite a few other use cases which provide a lot of practical
value. With the current generation of these large language models
(LLMs), many of the classic tasks in Natural Language Processing (NLP)
such as text classification, sentiment analysis, or named entity
recognition, are almost trivial to solve.
In this article, I have documented some experimentation with how to
use GPT-3 (update: and 3.5) to extract structured information
from unstructured texts and I hope the article can serve as a tutorial
for how to approach such a task with an LLM.
With about 500 job posts per month, some of them advertising multiple
roles, we end up with over 1000 jobs per month. As you can see on the site, there are several filters
such as "remote", "part-time", etc. which allow you to narrow down the
choices based on some general characteristics.
每月约有 500
个招聘信息,其中一些信息包含多个职位的内容,即我们最终每月有 1000
多个职位
正如你在网站上看到的,有几个过滤器,如“远程”、“兼职”等,允许你根据一些一般特征缩小选择范围
However, a job board usually also offers to select a certain job
category (e.g., "iOS developer"). Since the "Who is hiring" threads are
not limited to certain types of jobs and often there are interesting
roles that might not fit into the usual categories, I tried a different
approach for how to sort the jobs according to one's interests: For each
job, a text embedding for the job description (including the company
description) is created, which then can be used to sort by similarity to
a selected job.
Your task is to parse an unstructured job posting and turn it into a JSON containing the most important information. The job posting can describe one or more jobs at the same company. The JSON should consist of the following information: - The company name (field name: "companyName", field type: string) - the location of the company (field name: "companyLocation", field type: string); if not explictily stated, you can try to infer the company's actual location from other clues, e.g., something like "Remote (US)" usually means that the company is located in the US; if the location cannot be inferred, set it to null - a short description of what the company is doing or building (field name: "companyDescription", field type: string); try to keep it short (max length: ca. 300 characters) - a list of advertised jobs (field name: "jobs", field type: array). Each element of the "jobs" array should contain the following fields: - The job title (field name: "jobTitle", field type: string); the job title should be given in the singular form (i.e., Frontend Developer instead of Frontend Developers) - the salary range (field name: "salary", field type: string); only include explictly stated salary amounts, otherwise set to null - whether equity is part of the compensation (field name: "equity", field type: boolean) - the benefits (field name: "benefits", field type: string); include things like 401k, insurance, equipment, child care, etc. if stated, otherwise set to null - the location of the job (field name: "location", field type: string) - whether this is a job for senior/experienced candidates (field name: "senior", field type: boolean); typically senior, staff, lead, principal, vp, cto, etc. positions are all regarded as senior level - whether it is a remote opportunity (field name: "remote", field type: boolean) - whether it can be done onsite from an office (field name: "onsite", field type: boolean) - whether it can be done part-time (field name: "partTime", field type: boolean) - whether it can be done full-time (field name: "fullTime", field type: boolean) - the URL to the specific job description (field name: "jobUrl", field type: string) - and any specific requirements/skills that might be stated (field name: "requirements", field type: string). In general, if certain information is not stated, set the respective field to null. If the company seeks more than one person for the same role, include the role only once. Please output only the pure JSON representation. Do not include any explanations, comments, thoughts, etc. The output has to be a valid JSON object which can be parsed as is.