[Journey] Reverse Engineering Google with AI (Fine-tuning only. Advanced level)

splishsplash

Executive VIP
Jr. VIP
Joined
Oct 9, 2013
Messages
2,780
Reaction score
10,796
Website
wolfofblogstreet.com
I'm building a ton of interesting AI models so I've decided to create a journey for people to follow along.

ChatGPT Sucks

The first thing to point out is this is not some lame-ass journey using chatgpt to do more lame ass chatgpt stuff. :)

Chatgpt is a steaming pile of poop for a great number of tasks. It's super basic and you can find prompts all over the internet, so people love to hype it up and sell stuff like "97 Amazing SEO Prompts to Make you Big Fat Dollaz" and other such nonsense.

The reality is, chatgpt is made for dialog. It's only good at general chat and Q&A with people. It's hugely overused because it's all most people really understand. That gives anyone who understands fine tuning a ridiculously enormous competitive advantage. Heck, I doubt any of the AI marketing/SEO companies do a single bit of fine tuning. They're all using gpt 3.5 and gpt 4.

No joke. This image represents the situation.

The tiny dot in the middle is what you can achieve with chatgpt 3.5

The slightly bigger circle around it is what you can do with chatgpt 4

The massive circle is what you can achieve with fine tuning.

And the crazy thing is, you can in a lot cases achieve gpt4 level performance on a specific downstream task by fine tuning for that downstream task with models as small as curie. And for certain problems you can do it with ada/babage. You don't even need davinci for a lot of stuff, but a fine tuned davinci will easily match or beat chatgpt 4 for almost anything.

There is immense untapped power and capabilities within even the small models that you can only tap into by fine tuning for a specific downstream task.

c2dfc2c1-7fa9-4661-90d4-fb3333242dce.jpg

What I'll be sharing

I'll be sharing a fair amount of what I'm doing. I won't be going into intensive detail. Especially with hyperparameters and specific formats for the fine tuning as if I go into too much detail other companies can replicate it.

But I will give a high level overview of the journey, showing the results and talking about the individual fine tunes. How they work, the models used and what kind of data I use to fine tune them. (Just not my exact training data format)

The core plan is to be able to use fine tuned machine learning models to give very accurate predictions within SEO. I'll list out the first models I'm working on.

Initial Models

  • Classify a webpage
  • Generate a list of keywords that a webpage should rank for on page 1 of Google based on the headers and title(outline).
  • Generate a list of keywords that a webpage should rank for on page 1 of Google based on the full article (limited depending on model I choose because of size of context window)
  • Generate a list of keywords that a webpage should rank for on page 1 of Google based on outline + other extra data. Things like word count, topical authority scores, backlinks. This can be very varied.
  • Generate an outline for an article when given a keyword -- Super exciting one here. You can give a keyword + classification class(From first model) and it'll generate an outline based on what it's seen with existing pages ranking in position 1.

I'm working on the first model just now. I'm about to have dinner then I'll go into detail about the plan for it then share the results once it's fine-tuned.

 
are you fine-tuning using JSONL, outlining the prompt and responses or you're training your model with data using python code?
 
finr-tuning local models is more like getting a "degree" on a certain "topic".
though, no need to over-hype it

Do you think the average person understands the enormous difference?

You can never over-hype something enough when it emphasizes truth. Without the hyping people will just think fine-tuning is similar, or a little better than using chatgpt. They won't grasp it.

And besides.. This is blackhatworld. Why would we not hype something? Hyping should be in your blood. It's the core of marketing.

I have nothing to sell here, but I of course want people reading. I'll hype it so I get more eyes. What's the point of writing if no one's going to read it :) A shared journey is a lot more fun.

Also, finetuning isn't really like getting a degree. This is a misconception with finetuning. It should never be used for knowledge. Finetuning is for pattern recognition, not teaching facts/knowledge. For facts and knowledge you use a vector db and embeddings.

are you fine-tuning using JSONL, outlining the prompt and responses or you're training your model with data using python code?

I'm not sure what you're asking..

There's only 1 way to finetune with OpenAI. You can't do it with Python. It's done by running:

openai api fine_tunes.create --training_file xxx.jsonl --model the_model --sufffix "xxx"

The only direct training you'd do with Python would be if you were manually training something using PyTorch or some other lib for machine learning.

Or are you asking if I'm finetuning an open source model?
 
I'm not sure what you're asking..

There's only 1 way to finetune with OpenAI. You can't do it with Python. It's done by running:

openai api fine_tunes.create --training_file xxx.jsonl --model the_model --sufffix "xxx"

The only direct training you'd do with Python would be if you were manually training something using PyTorch or some other lib for machine learning.

Or are you asking if I'm finetuning an open source model?
I was asking if you were using the jsonl file and running the openai command or training the models using python libraries. From your response, I guess you're using the command.
 
I was asking if you were using the jsonl file and running the openai command or training the models using python libraries. From your response, I guess you're using the command.

There are no Python libraries for it as far as I can see.

Unless I'm mistaken and you can show me how to do it.
 
My first post is too big to do in 1 so I have to split it up
------

RESULTS

Get keywords from a page outline: First attempt


Model: Curie
Training examples: 100



Training Data

This first one is a bit sloppy. It's just a very casual initial test to establish a baseline for future experiments for the "get keywords from a page outline"

It's 100 examples from a variety of pages. Mostly informational/tutorial style pages. I'll go into detail on what needs to be done for the next iteration of the experiment and why we first need to train a webpage classification model for the data prep.

The steps
----------------


1) I gave chatgpt some examples of google searches I wanted and then asked it to give me 100 examples.

Searches were like "how do search engines work", "how to start coding", "how to invest in the stockmarket".
2) Using ahrefs I manually exported the keywords the page in position 1 ranks for.
This was the start of my ahrefs drama - https://www.blackhatworld.com/seo/beware-ahrefs-enterprise-api.1501345
I've since found better ways to get the data I need which I'll go into later.
So I ended up with 100 exported csv files.

3) Next I needed to prepare the data for fine-tuning from the csv files and generate workable outlines for each page.
First I wrote a Python function to get the outline of a web page. That one, I won't share the code with, as it's too secret saucy, but those determined enough will figure it out. "outline" gives enough away.
Then my program to prepare the data

import os
import pandas as pd
import json
import re

import sys
sys.path.append('/home/tom/projects/tools')
from get_structure_for_classify_webpage import get_outline

sys.path.append('/home/tom/projects/apis')
from serpstat import get_keywords

# Define the directory path
dir_path = '/home/tom/projects/data/ahrefs keywords for pages'

# Initialize the arrays
keywords = []
outlines = []

count = 0
# Loop through all csv files in the directory
for filename in os.listdir(dir_path):
print(f"Filename: {filename}")

if filename.endswith(('.xls', '.xlsx', '.csv')):
file_path = os.path.join(dir_path, filename)
df = pd.read_csv(file_path, encoding = 'utf-16le', sep='\t', header=0)

df = df[df['Current position'] < 6]
df = df.sort_values('Volume', ascending=False)
df = df.head(150)

if df['Keyword'].count() == 0:
print("Count is 0")
continue

print("Doing outline")
# Get the outline for the first URL in the DataFrame
# Assuming that 'Current URL' is a valid URL, if not please adjust accordingly
outline = get_outline(df.iloc[0]['Current URL'])
if outline == "ERR":
print(f"Couldnt get page: {df.iloc[0]['Current URL']}")
continue
outlines.append("\n".join(outline))
outlines[-1] += "\n\n###\n\n"
print(f"URL: {df.iloc[0]['Current URL']}")

print("Adding keywords to list")
# Add keywords to the list
keywords_tmp = df['Keyword'].to_string(header=False, index=False).split('\n')
keywords.append("\n".join([x.strip() for x in keywords_tmp]))
keywords[-1] += " ###"
keywords[-1] = " " + keywords[-1]


# Creating DataFrame from lists
df_final = pd.DataFrame(list(zip(outlines, keywords)), columns=['prompt', 'completion'])

# Saving DataFrame to json
df_final.to_json('keywords_for_pages.json', orient='records')


To replicate that you need to create your own "get_outline()" function and return your own version of a web page outline.

You'd also need to export .csv's in ahrefs from keywords ranking for a page, store them somewhere and change the variable "dir_path" to where you've stored them and it will work for you.

Summary of the code:-

It keeps everything in the top 5.
Keeps up to a max of 150 keywords.

Creates a training prompt like this

WEB PAGE OUTLINE

###


keyword1
keyword2
keyword3
keyword4
keyword5
...
### (The end token. I will use END next time)


The training data from this is stored in keywords_for_pages.json


4) Finally, we train!

The commands used are :-

A) openai tools fine_tunes.prepare_data -f keywords_for_pages.json

This prepares the data with openai's tool.

B) openai wandb sync

Sync with wandb. You don't have to use it, but I do. It's very powerful. It's a platform for machine learning training. wandb.ai

C) openai api fine_tunes.create -t keywords_for_pages_prepared.jsonl -m curie

Took about 4 minutes and cost something like $0.70. :)



5) We test the model

Best way to test this is to have the new model generate some keywords, then put them into keyword.com and see how many are on page 1.

I put together a streamlit app here to make it all nicer.

Here's the code


import streamlit as st
import openai
from streamlit.components.v1 import html
import pandas as pd

import sys
sys.path.append('/home/tom/projects/tools')
from get_structure_for_classify_webpage import get_outline

# Set your OpenAI API key
openai.api_key = "ENTER YOUR OPENAI API KEY"

def main():
st.set_page_config(page_title="Generate keywords from a page", page_icon=":robot_face:", layout="wide")

# Add custom CSS for the green button
custom_css = """
<style>
.stButton>button {
background-color: #4CAF50;
color: white;
}
</style>
"""
html(custom_css, width=0, height=0)

st.title("Generate keywords from a page")

col1, col2 = st.columns(2)

with col1:
st.header("Input")
model = st.selectbox('Choose the model', get_finetuned_models())
max_tokens = st.text_input("Max Tokens", 500)
stop_token = st.text_input("Stop Token", " END")
temperature = st.text_input("Temperature", 1)
top_p = st.text_input("Top P", 1)
url = st.text_input("Page URL")



with col2:
st.header("Keywords")
if st.button("Generate"):
generate_keywords(model, int(max_tokens), stop_token, float(temperature), int(top_p), url)



def generate_keywords(model, max_tokens, stop_token, temperature, top_p, url):
outline = get_outline(url)
if outline == "ERR":
st.write(f"Couldn't get {url}")
else:
prompt = "\n".join(outline)
prompt += "\n\n\n"
print(f"Outline is {outline}")
print(f"Model: {model}")
response = openai.Completion.create(
engine=model,
prompt=prompt,
max_tokens=max_tokens,
n=1,
top_p=top_p,
stop=stop_token,
temperature=temperature,
)

generated_keywords = response.choices[0].text
lines = generated_keywords.split("\n")
df = pd.DataFrame(lines, columns=["Keywords"])
st.dataframe(df, 800, 900)

def get_finetuned_models():
res = openai.Model.list()
df = pd.DataFrame(res['data'])
fine_tuned_models = df[ df['owned_by'] == "wolf-of-blog-street-inc" ]

return fine_tuned_models['id']


if __name__ == "__main__":
main()
 
To run that you need to pip3 install streamlit, save the above as "my_streamlit_app" and then streamlit run my_streamlit_app.py --server.headless true

Now let's do our first example

Page = https://buffer.com/library/how-to-write-website-copy/

Here's a screenshot of the screenlit app running with the result


Here's the list of keywords the model gave us back.




website copy
website content creation
what is good website copy
what is good web copy
web copy
effective web copy
best web copy
building copy for website
creating content for website
how to write web copy
good website copy
customer copy
how to write good website copy
good copy
good copy for website
copy for website
how to write good stuff for a website
website content creation that's good
what does a writer do for a website
create web copy
writing website copy
writing web copy
creating website content
what is good web text
best practices for web copy
text for website
content marketing for website consists of:
website writing
good web page copy
good page copy
need to have quality website copy
good web content copy
how to write webpage copy
requirements for web copy
client copy
finding the right piece of copy for your business site is hard work.
learn how to create good website content
what is a good web page
web page copy
web page content
how to composition web copy
finding website copy
writing copy for website
best copy
good copy for a website
creating website content that is good
what is effective website copy
effective copy for website
best way to create website copy
how to write website copy that's good
created website content
website content creation practices
web writing
copy for website
copy for web
web site copy
who write website copy
web content for website
writing website content
writing web text
website content stylist
content editor for websites
good articles for websites
writing for a website
a good article for a website
good copywriter
custom copy for websites
winning web copy
best keywords for web copy
elements in marketing relate to copy are
creating great website copy
article writing website
best web content copy
create web copy that is good
make good copy
domain copywriting
the best website copy
buy website copy
writing web copy that is appealing
best practices for webpage copy
writing web page copy
best editor for websites
web content synthesis
website writing are very diverse and can be
moderating for websites
writing for websites can also be challenging
noteworthy website copy
best content for websites
what makes good copy for a website
articles are a very important part of
 
Next we put this into keyword.com and see how many are on page 1.

That's pretty good for a first attempt. 36 of 90 in the top 3 and 41/90 in the top 10.

52 in the top 100 too, but what it shows is that of the 52 good keywords, 41 of those are top 10.

That's 46% of returned keywords in the top 10.

Try doing that with chatgpt. You won't be able to. You can prompt it all fucking day and it'll give you total garbage.


I mean look at this.. It's given us ones that are featured snippets.

Also here's a screenshot showing my ahrefs csv export data directory..

You see when I do ls *lumar* we can see the export for lumar.io?

Notice I do ls *buffer*, and there's nothing. This site was NOT in the training data. The model has already generalized to pages outside the training set.

And this is CURIE. It's not Davinci!

I'm confident we can even get great results with Babbage and maybe even Ada.

I mean look at these keywords it's given us..

"what is good website copy"
"how to write web copy"
"what is good web copy"
"writing website copy"
"good web content copy"
"writing copy for website"
"creating great website copy"

You'd think this was a keyword tool giving you actual related keywords from the Google Adwords keyword tool, but it's NOT. It's an AI looking at the outline of a webpage and then giving you keywords that it should rank for based on that outline.

THIS is a definitive first step in reverse engineering Google.. This simple AI already understands what keywords should be page 1 for the outline.

Now imagine how far we can go with this. Giving a keyword and getting a perfect outline, perfect on-page. Exact topics to talk about. Exact headlines to use. What to link to. What other articles to create to build topical authority.

Let's try one more page.

Page: https://www.wordstream.com/blog/ws/best-facebook-ads



Keywords it gave :


how to write facebook ads
how to write a facebook ad
write a facebook ad
how to write ads for facebook
writing facebook ads
writing effective facebook ads
best practices for writing facebook ads
writing ads for facebook
how to write facebook marketing ads
writing ads on facebook
how to write a facebook marketing ad
writing ad for facebook
how to be an ad writer for facebook
how to become a better facebook advertising writer
writing effective facebook ads
how to create facebook ads
how to be a better ad writer for facebook
how to write a facebook page post
write an effective facebook ad
how to create facebook ads that get lots more people
write your facebook ads
how to get consumers to click in a facebook ad?
how to write an effective facebook ad
example of a good facebook ad
write a facebook ad post
how to improve facebook ad campaigns
ad writing for facebook
best practices and strategies for writing effective facebook ads
how to improve facebook ads
example of good facebook advertisement
best practice for writing facebook ads
strong headline for facebook ad
how to write facebook ads that work
how to write an effective advert for facebook
facebook ad writing tips
appropriate ad format for facebook
how to create facebook ads that convert
advertisement writing for facebook
facebook ad copy critique for effective copy
how to reduce spam on facebook
facebook ads online content review
how to write effective facebook ads
how to personify in facebook ad
how to write a good facebook ad
best practices in facebook ads
how to become the best facebook ad writer
2dcreative carr lahig cood
ad creating for facebook
what is the best word to use in a facebook ad?
how to improve facebook ads that don't convert?
best facebook ad practices
best practices for writing great facebook ads
writing facebook ads that work
how to write your facebook ad
good ad format for facebook
how to be one of the best facebook advertisers?
facebook ads copy
copywriting a facebook ad
writing facebook ad copy
help writing facebook ads
how to write the best facebook ad
write facebook advertisements
number 1 content for facebook ad
best advertisment for facebook ads
writing an effective facebook ad
facebook ads best practices
best practice for writing a facebook ad
best practices for facebook ad copy
how to write an effective facebook ad post
good ad format for facebook ads
best practices for facebook advertising


Result on keyword.com

It's glitching and taking a while to give me the summary, but you can count them. There's 49 out of 70 this time in the top 10. 46 in the top 3. That's INSANE! 70% on page 1.


Look at these, lol..

"how to become a better facebook advertising writer" <-- Wow.

"write a facebook ad"
"write an effective facebook ad"
"how to write your facebook ad"
"writing facebook ads that work"

This thing, is in agreement with Google's complex algorithm with only 100 training examples..

Another interesting thing we can do from this

So this is training for keywords that a page should rank for on page 1.

Imagine we train it for keywords on page 2, then one for page 3.

Why would we do that?


Topical authority..

Take a page on your site and put it into a model trained to give you page 2 keywords, which means keywords that are fairly related, but not spot on, and you have keyword ideas for building topical authority. Page 3 too is the same.

(This works btw because we're taking pages for the training data that rank #1 for the major search term)

If a page has a strong ranking for its main term, then anything it ranks for on page 2 is there not because the page is weak, but because it's a related keyword that belongs in another article.

So the model learns, for those pages, the ones on page 2 and 3 must be slightly related, but not SUPER related(ie, belonging in the same article )
 
Last edited:
This looks like a promising case study and consider AI and chatgpt ain’t going anywhere it’s a sweet bonus for all BHW members excited to see what all twists you’ll keep on adding and best of luck
 
I've been doing similar but not by fine-tuning.
Following!
 
I personally think this is pretty damn clever. I wanna see what happens in the future. Keep rockin
 
Plan for Next Fine Tuned "Get Keywords From a Page Outline"
Ok, so in the first version of this model I just used 100 samples. Mostly from informational/tutorial style pages.
Results were outstanding for the first test.
Now I'm going to scale this one into something insane.
To really make this outstanding what we need to be able to do is create multiple training samples for different "classifications" of web pages.
Because if you think about it...
The type of keywords that an informational post should rank for based on its outline is quite different to an ecommerce page.
The next training run is going to have data like this


Class: CLASS_TYPE

OUTLINE

###

keyword1
keyword2
keyword3
...


Where CLASS_TYPE is one of (and I'm open to suggestions for anymore you guys think would warrant a class of its own)

Informational - Regular informational style articles

ecom product - A single product page

ecom category - A list of multiple products

single product review - A review of a single product

best reviews - Multiple reviews/top10/best X page

news - A news article

faq - A FAQ page of some sort

forum - A forum post

tutorial - I want to differentiate between informational posts and tutorials guides. A tutorial would be like the big guides you find when you type things like "beginners guide to seo", or "how to do content marketing". These are not really 'informational', they are tutorials which are different. Informational is more your "how to groom a persian cat" etc

service - Service pages. Plumbers, seo services.. Any sort of business/person selling a service.

recipe - cooking recipes

That's all I can think of for now, but please make suggestions if you think I've missed something.


Now, that's a total of 12 classes.


In order to get really good accurate training data for this I need an automatic classifier, so the next model I'll train will be a classifier model.

You might ask, but don't you still need training data for the classifier..

Yes, I do, but the difference is, ANY OLD PAGE will work for training the classifier. It doesn't have to rank #1 for a major keyword(Without this #1 ranking, the keywords it currently ranks for aren't accurate. Because we only use pages ranking #1 for a major keyword, we rule out other factors like, lack of backlinks, lack of topical authority. If a page is already ranking #1 for a main keyword then we know the keywords it ranks for are valid for the page. Compared with a page that's not ranking.. It might be page 5 for a certain keyword, but that's not because the on-page is bad, it's because of other factors like lack of power, lack of topical authority or lack of age. We want to rule out these so the keyword data we get correlates only to the on-page structure)

So when training the classifier model I can just take any page that fits into the category.



I'll go into the classifier model in the next post, since it'll be the next to train but I'm detailing the next iteration of the "get keywords from a page outline" first since it follows on from the last result. (For the classifier model I'll add in other classes like legal document, and ones that won't naturally rank, but for the keywords model we want only classes that will rank)


Ok, so..

For each of the classes I want 2000 examples. That's a total of 24,000 training samples. 240 times more than the first one.


Hyper Parameter Issues
I really want to see what happens when we go big with the training samples.

I might need to experiment with learning rate and batch size this time. 22,000 training examples will cost around $300 to fine-tune with curie.

So what I might do first is fine-tune some variations of babbage so I can test

OpenAI defaults to a batch size of 0.2% the number of training examples, so for 24,000 that's 48.

Normally a higher batch size is cheaper to train when you're training your own models, but it can lead to overfitting.

I'm not sure how it affects cost on OpenAI. I will find out though when I train on Babbage. It's super cheap to train. Curie is $0.003/1k tokens to train and $0.012/1k to use. Babbage is $0.0006/1k to train and $0.0024 to use. Davinci is crazy expensive. $0.12 to USE. That's twice as expensive as gpt4 8k, and the same price as completion tokens on gpt4-32k. That should tell you how powerful it really is though. I do plan on finetuning davinci, but that'll be for content writing. One of the later models will be a finetuned davinci model for writing content, trained on content that ranks. That's the ultimate fine-tune. It'll blow even gpt4 content out the water.

So, back to babbage.. Fine-tuning 24k training samples on that (Each is about 1k tokens) will be about... **DRUMROLL**


Wait for it...

$14 !

Haha.

So, yeah, with babbage we can try 10 variations on hyperparameters, then see which is best and apply that knowledge to curie. I am assuming what works with hyper parameters for babbage will carry over to curie. It might not, but this is all part of the research. Worst case I'll drop $1k-$2k on curie to learn if I have to. I have a 6 figure research budget, but I don't want to just blow the money too fast. There's a lot to do.

OpenAI recommend, for non-classification tasks, ie, generating. What we're doing here, or summarizing, writing content.. Things that can't be validated. They recommend 1-2 epochs, a lower learning rate and more training samples. This makes sense compared with a classification task, where a higher number of epochs and less training samples will work.

An epoch is a run through the training data. You can run through it multiple times.. Each time you do the weights get adjusted with gradient descent. Like a human.. We don't learn by just reading a book once. We can read a book 2-3 times and learn more. Maybe 7-8 times it becomes useless, OR, if we do it too many times we get "overfitting", ie, we just learn the exact patterns of the book and can't generalize so well, so we're better off reading 100 books, 2-3 times, than 10 books, 25-30 times :)

The learning rate at a mathematical level is.. Imagine you have a graph, shaped like a V, but not straight, curved. A cross between a U and a V. You want to find what's called the "local optima". If its too large, then you risk going too far in one direction, then too far in the other, and never reaching the lowest point. The disadvantage to a very small learning rate is slow training, so it needs to be big enough that training isn't stupidly slow, and small enough that you can find the local optima, which is the point where the error is as close to 0 as possible. That's what training is. It's billions of mathematical curves, where you need to find the minimum point on the curve, which is where the error is lowest. That's ALL machine learning is at the most fundamental level. It's calculus. You're trying to find the point on the graph where the gradient is 0. Ie, gradient descent. And it's done through testing(Training)

In the real world, a neural network isn't a simple bell curve U/V graph like that. It's shaped like this - https://share.getcloudapp.com/2Nup12XR - It has local optima and global optima

If you use a learning rate that's too big you might end up only reaching the local, which is what overfitting is. Ie, the model doesn't generalize well to new data outside the training set. Here's an article where you can learn more - https://medium.com/analytics-vidhya/journey-of-gradient-descent-from-local-to-global-c851eba3d367 (If you're mathematically inclined

You always use a much smaller learning rate when finetuning. Generally around 10 times smaller. If you use a large learning rate then you risk losing the learned features of the original model.

Oh, and I'll explain batch sizes and how it relates to learning rate.

The batch size is the amount of training samples processed before you update the weight. Ie, before you do your gradient descent and use your learning rate. Larger batch sizes are faster to train, and according to OpenAI larger learning rates work with larger batch sizes.

So what I'll probably do when testing babbage is run the following :-

learning rate multiplier: 0.05, batch size 8
learning rate multiplier: 0.1, batch size 16
learning rate multiplier: 0.2 batch size 32
default letting it calculate

Then test to see which is best and apply that to the finetune with curie.

So.. Next job that needs to be done before this can be finetined is the classifier model
 
Plan for Next Fine Tuned "Web Page Classifier"


For this one we'll have training samples that look like this :-


PAGE OUTLINE

###

CLASS


Where class is one of

informational
ecom product
ecom category
single product review
best reviews
news
faq
legal doc
forum
tutorial
recipe
blog category
service

I'll probably come up with some more. I just need to have a look around the web more to see what I've missed.

I'll finetune first on babbage, then curie.

Same strategy as the "get keywords from a page" model where I do 4 finetunes on babbage.

learning rate multiplier: 0.05, batch size 8
learning rate multiplier: 0.1, batch size 16
learning rate multiplier: 0.2 batch size 32
default letting it calculate


Then apply to curie.

This will give me more data on how this works with classification models compared with conditional generation models.

I'll also do the 4 finetunes above for ada, as ada is suppose to be very capable with classification tasks, and it's 33% cheaper still than babbage.


Side Note

The REAL fun is going to begin once I have all these OpenAI models created and have essentially a baseline to validate/test open source models.

Once I have ALL the data and all the OpenAI models trained I'm going to rent a cluster of 8 H100's for 3 months and go to town finetuning them all on the open source models.

I've trained a llama-7b on an A100 cluster with Deep Speed, so I've got all the working code and techniques ready to finetune llama-7b, 13b, 33b and 65b.

Well, I'm going to use lit-llama technically since that's an Apache 2.0 license version of Meta's original weights.

Then.. The one that's got me EVEN MORE excited if that's possible is :-


New Open Source Model That Could Solve Topical Authority

MPT-7B

This one has a context window of 65K

SIXTY FIVE THOUSAND!

Twice gpt4's 32k..

And in testing they got their context window as high as 84k

65k tokens is 40k to 45k words.

That opens up so many possibilities.

Instead of using outlines, we can use ENTIRE ARTICLES.. Including full HTML



OR..

Let's calculate this..

A title for an article is on average 10 tokens.

65k tokens then is 6500 titles.

We could train a model with every article title on the a site and get topical authority classifications back.

I've got a method for doing that, but that one is too good. I don't want to share that because there'll probably be some AI SEO company reading this and they'll implement it and get a topical authority classifier before me. Some things I just can't share unfortunately.

I'm about to finish up for the night. It's 4am here. Tomorrow I'll start gathering the data for the classifier model. I'll use Built With to help me gather the samples. I'll go for around 500 samples for each class, so that should be around 15k samples total.
 
Back
Top
AdBlock Detected

We get it, advertisements are annoying!

Sure, ad-blocking software does a great job at blocking ads, but it also blocks useful features and essential functions on BlackHatWorld and other forums. These functions are unrelated to ads, such as internal links and images. For the best site experience please disable your AdBlocker.

I've Disabled AdBlock