Fine-tune a better-than-GPT-4 classifier for $30

Classification
Heard of fine tuning but not sure whether it's worth it? Nick Webb and Parn Boonyamanee explain how to fine tune a GPT-3.5 classifier to get fast, high-quality results

We're excited to share how you can improve the quality of a GPT-3.5 based classifier to GPT-4++ standards using nothing but GPT-4 data itself - all for less than you might spend on a few beers.

Reading marketing materials about AI products, you'd be forgiven for thinking that all developers are building their own models from scratch.

While that may have been true before the launch of ChatGPT (and is still true in various specialized domains), most of the AI products you see on the shelf are orchestrations of prompts and data "wrapping" stock LLMs like OpenAI's GPT-3.5 or GPT-4.

Simply put, the amazing thing about the GPT-n series is that if you can translate a problem into a language they understand (i.e. English) they can deliver good results - far better in many ways than specialized models trained for a specific task. Sure, they may be massively less efficient, but in absolute terms the costs are very low.

As a result, tasks like classification ("is this passage written about a bird or a fish?") are now commonly delegated to LLMs.

The emergence of GPT-4 sets a quality bar

From a product design point of view, GPT-4 presents something of a dilemma.

On the one hand, it has excellent performance and in practice vastly outperforms GPT-3.5.

On the other, it's slow and much more expensive (30x).

What if you wanted to have your cake and eat it? Well, you can! Read on to find out how.

What is fine tuning? Marketing-speak vs reality

A quick sidebar. When marketers talk about "training" an AI product colloquially, there are three things they might be saying:

  1. They have trained a model from scratch. This is rare and expensive. Most people are not doing this. Explaining how this works is beyond the scope of this article but there are lots of other articles you can read on it.
  2. They are fine-tuning a pre-trained LLM. This is about taking an existing model and showing it a number of examples specific to a given use case or problem domain to create a new, derivative model.
  3. They are prompt engineering and want it to sound fancy. There's nothing inherently wrong with this approach, but as we'll see below there are limits to what you can achieve with prompt engineering alone.

Why fine tuning?

In reality, most creators are not in a position to train a model from scratch. So we're really picking between fine tuning and prompt engineering (or a combination of the two).

Fine tuning may be a solution in three situations:

1. You want to change the way a model "talks".

One of my favorite applications of fine-tuning is generating content in a specific style, such as a custom tone. Previously, we tried to achieve this using an inline prompt method, for example, by adding "write ... in Elon Musk style". However, the results are not very good The fine-tuned model performs much better.

Harrison Chase, the founder of LangChain, created an example of this with the Elon Musk Tweet Generator website. The site generates a Tweet in Elon Musk's tone and allows for a comparison between the fine-tuned tweet and the inline prompted tweet.

musk-tweet

One of the challenges with LLMs is that answering the question, "why is the finetuned version better?" is extremely hard. In this case my belief is that the prompt does influence the model but Musk's tweets are too small a fraction of the overall training dataset for the model to go full-Musk when simply prompted. However, it's not impossible a more sophisticated prompt e.g. few-shot learning could work.

Fine tuning or prompt engineering: toss-up

2. You want to teach a model about a new knowledge domain

You might want to teach the model facts. For instance, you might want to teach an online takeout ordering chatbot your restaurant's menu. Or you might want to teach your HR chatbot your company manual.

The challenge here is that fine tuning doesn't eliminate hallucinations. Unlike with prompt engineering where you know exactly what internal company information the LLM had access to, fine tuning is a new (potentially better) black box.

Because fine tuning requires a training and test dataset you may also find that much of what you think the model has been trained on is in fact never "taught" to the model.

Use case quality: unproven. The real issue here is assessing whether answer quality in fact improves.

3. You want to improve a model's performance on a simple but important task

Now we're getting to the good bit! Prior to LLMs, many AI applications were simple classifiers. For instance, given a tweet, classify its sentiment as positive, negative or neutral. PhD theses have been written on this problem.

LLMs are an extremely flexible (if heavyweight) approach to this problem. If your input is in (or can be translated to) words, you can use an LLM to classify it.

As we'll show you, classification is an area where fine tuning can supercharge your performance. This isn't unexpected exactly but the extent to which it works is frankly staggering both in terms of quality, cost and speed.

Use case quality: very good!!

Why a classifier

For background, Harriet is an AI assistant who organizes your internal company policy and process data and operationalizes it, helping employees self-serve with benefits and HR questions, enabling sales reps to answer vendor security questionnaires and interacting with your internal IT ecosystem so that your coworkers don't have to do boring admin by hand.

At Harriet, we use classification in many places in our application. For instance, when we receive a message from a user, the first thing we do is attempt to match it to the user's "intent" - what are they trying to do? We call this a "router".

The classifier has access to the conversation history and is expected to choose a single "destination".

Router

We were finding that the number of routing errors was forcing us to build a sort of "safety net" of workarounds and secondary classifiers for known issues ("if a user's request ends up here but it has xyz characteristic, in fact it should go to this different destination"). This was neither elegant nor easy to reason about and led to unexpected behaviour.

But improving the classifier prompt was not an option. As it got longer, it got more complicated and slower. As we added more examples, we caused it to overfit and bugs to appear in cases that it had previously handled well.

GPT-4 vs 3.5

We performed a number of assessments using GPT-4 and GPT-3.5. We both used GPT-4 to generate test data and used user-generated data. Results were as follows:

Synthetic data User generated data
gpt-3.5-turbo 70% 52%
gpt-4 88% 76%

To nobody's surprise, GPT-4 did better. But it's waaaay slower. So we decided to fine tune.

cost vs speed

Fine tune GPT-3.5: step by step

Before we begin, we need to import all the packages we need and set up an OpenAI API key. The below assumes it's available in your local environment as OPENAI_API_KEY.

import os
import openai
import json
import random
from tenacity import (
    retry,
    stop_after_attempt,
    wait_random_exponential,
)

openai.api_key = os.getenv('OPENAI_API_KEY')

1. Generate a dataset

Firstly, we need a dataset. You can use a real dataset or generate one, depending on your situation.

In our case, we will use both GPT4-generated data and real-life data. We found that synthetic data worked well for this use case but YMMV.

We experimented with different volumes of training data. Even small datasets (we initially used 400 examples) seem to have a huge impact on quality.

dataset = []

# Data from GPT4
messages = [
    {'role': 'system', 'content': 'You are an employee in the 200FTE company'},
    {'role': 'user', 'content': 'Give me 25 common questions employees ask HR team. Answer in JSON format, eg: [{"question": "..."}, {"question", "..."}]'}
]
response = openai.ChatCompletion.create(
    model='gpt-4',
    messages=messages,
    temperature=0,
)
data_from_gpt4 = json.loads(response.choices[0]['message']['content'])

for question in data_from_gpt4:
    dataset.append({'question': question.popitem()[-1]})

# User data
data_from_human = [
    {"question": "Can you get my payslip for last month?"},
    {"question": "I need more information from HR team on our maternity policy. Could you let them know?"},
    {"question": "I'm not feeling well today, can you book a dayoff for me?"},
    {"question": "How can I file my expenses from client meeting?"},
    {"question": "Can you ask HR to get me a new laptop?"},
]

# Save dataset
dataset = dataset + data_from_human
with open("data/raw.json", "w") as outfile:
    json.dump(dataset, outfile)

print('Dataset: ', dataset)

2. Label the dataset

Our initial dataset is just questions. To train, we need to give the model the "right" answers.

We used GPT-4 to generate the answers for each question. Our initial question was not "can we get human-like performance" but rather "can we get GPT-4 like performance", so it seemed simpler and cheaper to start with automatic labelling. Including human-labelled data did, in fact, further improve the model's performance.

📝 Note: Below we shuffle the questions for each answer generation. We wanted to get the model to pay as much attention as possible to the semantic content of the prompt rather than using shortcuts such as the order of appearance in the prompt. This is an area that may require detailed thinking as you prepare your dataset.

# Prevent OpenAI Rate limits
@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(20))
def completion_with_backoff(**kwargs):
    return openai.ChatCompletion.create(**kwargs)

def get_actions():
    actions = [
        '    get_policy: Find policy - ',
        '    book_holiday: Book holiday -',
        '    get_payslip: Get payslip - ',
        '    escalate: Escalate to HR - ',
    ]
    random.shuffle(actions)
    return '\n'.join(actions)
    
def get_prompt():
    prompt = """
    You are question analyst.
    
    Given a raw text input to a language model select the next action best suited for the input. 
    You will be given the names of the available action and a description of what the action is best suited for.

    Answer in JSON format. Eg. {"action": "get_policy"}
    
    Action:
"""
    return prompt + get_actions()

dataset = []

with open('data/raw.json') as f:
    questions = json.load(f)
    random.shuffle(questions)

    for question in questions:
        q = question.popitem()[-1]
        print(f'Asking: {q}')
        messages = [
            {'role': 'system', 'content': get_prompt()},
            {'role': 'user', 'content': q}
        ]
        response = completion_with_backoff(
            model='gpt-4',
            messages=messages,
            temperature=0, 
        )
        response_json = json.loads(response.choices[0]['message']['content'])
        dataset.append({
            'question': q,
            'action': response_json['action'],
        })
        print(f"Response: {response_json}")

json_object = json.dumps(dataset, indent=4)
with open("data/processed.json", "w") as outfile:
    outfile.write(json_object)

print('Dataset: ', dataset)

3. Format the dataset

OpenAI Model Fine-tuning can only read datasets in a certain format. We need to convert it to OpenAI Message Format first, which contains 3 important keys:

  • "role": "system" - the prompt
  • "role": "user" - the question
  • "role": "assistant" - the expected output

After formatting, We need to partition the dataset into Training and Validation sets in jsonl files, which is a file where each line is a JSON object.

with open('data/processed.json') as f:
    questions = json.load(f)

dataset = []
for question in questions:
    dataset.append({
        'messages': [
            {'role': 'system', 'content': get_prompt()},
            {'role': 'user', 'content': question['question']},
            {'role': 'assistant', 'content': "{\"action\": " + str(question['action']) + "}"},
        ]
    })
    
def save_to_jsonl(conversations, file_path):
    with open(file_path, 'w') as file:
        for conversation in conversations:
            json_line = json.dumps(conversation)
            file.write(json_line + '\n')

            
# train 80%, validate 20%
train_ratio = 0.8 
num_train = int(len(dataset) * train_ratio)
save_to_jsonl(dataset[:num_train], 'data/train.jsonl')
save_to_jsonl(dataset[num_train:], 'data/validate.jsonl')

4. Fine-tune a model

This step is super simple. We just need to upload our files and tell OpenAI to start fine-tuning. You can do it by the code below.

# Upload dataset
training_file_name = './data/train.jsonl'
validation_file_name = './data/validate.jsonl'

training_response = openai.File.create(
    file=open(training_file_name, "rb"), purpose="fine-tune"
)
training_file_id = training_response["id"]

validation_response = openai.File.create(
    file=open(validation_file_name, "rb"), purpose="fine-tune"
)
validation_file_id = validation_response["id"]

print("Training file id:", training_file_id)
print("Validation file id:", validation_file_id)
suffix_name = "routing-demo"

response = openai.FineTuningJob.create(
    training_file=training_file_id,
    validation_file=validation_file_id,
    model="gpt-3.5-turbo",
    suffix=suffix_name,
)
print("Response: ", response)

You can check the status on the OpenAI Dashboard.

Time taken depends on the size of your dataset.

5. Test the model

After OpenAI finishes training your model, You will get an email from OpenAI that will contain the finetuned model's information or you can just go to your dashboard. Copy and paste the model name into the code.

🎉 Tada! Your fine-tuned model is ready to use now! We recommend validating the performance of the model next.

model = 'ft:gpt-3.5-turbo-0613:personal:routing-demo:xxxxx'
question = 'Can you find me my payslip on June 2023'
messages = [
    {'role': 'system', 'content': get_prompt()},
    {'role': 'user', 'content': question}
]
response = completion_with_backoff(
    model=model,
    messages=messages,
    temperature=0, 
    max_tokens=500,
)
print("Response: ", response.choices[0]['message']['content'])
# {"action": get_payslip}

Performance

We found a very significant impact on performance from the finetuned model:

Finetuned model actually performs better than GPT-4
Synthetic data Quality improvement over GPT-3.5 User generated data Quality improvement over GPT-3.5
gpt-3.5-turbo 70% n/a 52% n/a
gpt-4 88% 60% 76% 50%
gpt-3.5-turbo-finetuned 88% 60% 78% 54%

As you can see in the table above, we found that the finetuned model performs slightly better than GPT-4 (although this was within the margin of error given sample size).

This is particularly surprising given that both questions and labels were generated by GPT-4 and we only used human labels for assessing overall performance.

Not to overlook the other good news: the finetuned model, while more expensive than vanilla GPT-3.5 (3x the cost) is 1/10th the cost of GPT-4. And (for our use case, more importantly) it's also much faster.

Conclusions

We were so surprised by the fine tuned model's performance that we thought we had made a mistake somewhere. It just didn't seem plausible that it worked so well. As far as we can work out, we haven't.

We were also gratified to learn that we hadn't overtrained the model. For instance, as part of our training data we used string identifiers for routing destinations, but our code actually uses integer codes to identify classifications. Our finetuned model handled without any noticeable downgrade in performance.

As we said at the top, using GPT as a classifier may seem like the wrong tool for the wrong job. However, when you have such powerful tools to refine and add secret sauce on top of an easy-to-understand and tweak abstraction like GPT-3.5 (we can continue to prompt engineer on top of the finetuned model where needed), we can massively increase the confidence that we have in our application's moving pieces.

Just because fine tuning works well in this domain doesn't mean it's the right thing to do for you. But if you haven't tried it, you might be missing a trick.

Parn Boonyamanee is a software engineer at Harriet. Nick Webb @nickwebb is an independent consultant specializing in AI model design, fine tuning and validation.

More about Harriet

Harriet brings together all your company systems in a single Slack conversation so your team can forget about admin and get their real work done. Want to find out more? Book a call now.