[Journey] 5 x AI Sites | 5 Different Approaches | 1 Million Cumulative UVs/Month

Is the site #1 the classical PPA site like we used to see on early days?
 
I'm also putting my hands on an interlinking script. Let's see how it goes.
My goal is to make it a step further to fit auto-linking nicely with manual linkings so I can use it either on auto sites and white hat semi-manual sites.

I remembered you posted a thread about this, so I'm here again for some inspiration.

Thanks for the great share. BlogPro.

Edit:

Your post is the reason I bumped this thread.

Did you make any progress with the script?

From your process, you did not mention info about locating keywords in content.
Does this mean you are inserting a link in-between the paragraphs, like the inline-post plugin, but not like a standard link like <p>hello<a>this is thelink</a>end of text</p>, like linkwhisper?

See the first point in that same post, that's where the extraction happens. I'll draw a mindmap maybe to explain better. A wall of text may not necessarily be the best way to explain this.

And no, I don't use inline related posts - I am replacing key phrases with internal links within the Silo

My problem is this is very resource intensive right now. So trying to figure out how to do better.


So.. just asking about the elephant in the room: You gonna share some of those automation scripts?

i've worked for months trying to build a viable content creation script to scrape sources, rewrite, and re-post... with nothing but failure for the last 4-6 months.

would kill to get a little insight.

While I don't think I'll be sharing the whole script. I have and will continue to share scripts that can help people.

I have shared a language model based similarity detection script (not a deduplicator). I have also shared again a language model based keyword clustering script.

A lot of BHW members use them as part of their workflow.

Is the site #1 the classical PPA site like we used to see on early days?

Yes, absolutely classic PAA. But designed to look very pretty. Overdesigned even.

All my PAA are running under the radar for now. Before someone on Twitter detects them and tags someone at Google.

//

Update wise

The sites have pretty much plateaud. They're doing well, but like the traffic isn't increasing. But it's not reducing either. So we'll let them be for sometime.
 
Your post is the reason I bumped this thread.

Did you make any progress with the script?



See the first point in that same post, that's where the extraction happens. I'll draw a mindmap maybe to explain better. A wall of text may not necessarily be the best way to explain this.

And no, I don't use inline related posts - I am replacing key phrases with internal links within the Silo

My problem is this is very resource intensive right now. So trying to figure out how to do better.




While I don't think I'll be sharing the whole script. I have and will continue to share scripts that can help people.

I have shared a language model based similarity detection script (not a deduplicator). I have also shared again a language model based keyword clustering script.

A lot of BHW members use them as part of their workflow.



Yes, absolutely classic PAA. But designed to look very pretty. Overdesigned even.

All my PAA are running under the radar for now. Before someone on Twitter detects them and tags someone at Google.

//

Update wise

The sites have pretty much plateaud. They're doing well, but like the traffic isn't increasing. But it's not reducing either. So we'll let them be for sometime.
yes, I did it, just like manual link building.

With some randomness and trying to make it natural.

I have typed a lot to answer but I deleted those because I thought I couldn't make myself clear.

To make it short.

I was thinking like you did, trying to figure out the best internal link with the best/highest cosine similarity based on the whole paragraph or entire h2/h3 section text.

It looks like that, but in reality, people link when they feel like. They don't take the best anchor for internal linking.

For example, if you have an article titled " best weight lost guide" and you also have a money article titled " how to lose money". And you want to link weight lost guide to how to lose money.

You might choose anchors from "lose weight", "losing weight", "weight lost"....... and so many anchors which look appropriate to link to "how to lose money" pages.

I was thinking like you did, trying to figure out the best heading/section with the highest cos similarity to make the interlink.

but I've been thinking about it for weeks. In reality, people don't choose the "best" anchor for internal linking.

If the anchor "lose weight" exists 10 times in the article, people might just choose randomly or pick whichever looks nice as the anchor.

So the key point is to determine the anchor. The cos similarity is far less important.

If I have a site about mobile phone tech. And I want to link to a page about Apple phones. Even when I am talking about real apple trees, I can also make apple tree the anchor to link to the apple company, because it makes sense.

So I finally realized that the most important thing is the anchor itself, not the contextual cos similarity.


So my process is that:

1) I made a script to find all possible anchor texts in a paragraph.

2) for all anchors, if multiple paragraphs have the same anchor, I sort them based on paragraph similarities.

3)I then make some PHP algo in WordPress to choose the best anchor based on versatility, similarity score, manual link pages and such.

4) If I have already made a manual link to some page, I will ignore that target page in my auto-link process.

Sorry for not making myself clear, it's weekend and I had a good drink.

I think what we did wrong is that we always thought cos similarity is what matters most. But in reality, it does not. The most important thing is to choose the proper anchors.
 
yes, I did it, just like manual link building.

With some randomness and trying to make it natural.

I have typed a lot to answer but I deleted those because I thought I couldn't make myself clear.

To make it short.

I was thinking like you did, trying to figure out the best internal link with the best/highest cosine similarity based on the whole paragraph or entire h2/h3 section text.

It looks like that, but in reality, people link when they feel like. They don't take the best anchor for internal linking.

For example, if you have an article titled " best weight lost guide" and you also have a money article titled " how to lose money". And you want to link weight lost guide to how to lose money.

You might choose anchors from "lose weight", "losing weight", "weight lost"....... and so many anchors which look appropriate to link to "how to lose money" pages.

I was thinking like you did, trying to figure out the best heading/section with the highest cos similarity to make the interlink.

but I've been thinking about it for weeks. In reality, people don't choose the "best" anchor for internal linking.

If the anchor "lose weight" exists 10 times in the article, people might just choose randomly or pick whichever looks nice as the anchor.

So the key point is to determine the anchor. The cos similarity is far less important.

If I have a site about mobile phone tech. And I want to link to a page about Apple phones. Even when I am talking about real apple trees, I can also make apple tree the anchor to link to the apple company, because it makes sense.

So I finally realized that the most important thing is the anchor itself, not the contextual cos similarity.


So my process is that:

1) I made a script to find all possible anchor texts in a paragraph.

2) for all anchors, if multiple paragraphs have the same anchor, I sort them based on paragraph similarities.

3)I then make some PHP algo in WordPress to choose the best anchor based on versatility, similarity score, manual link pages and such.

4) If I have already made a manual link to some page, I will ignore that target page in my auto-link process.

Sorry for not making myself clear, it's weekend and I had a good drink.

I think what we did wrong is that we always thought cos similarity is what matters most. But in reality, it does not. The most important thing is to choose the proper anchors.
Wake up the other day and I really made a lot of typos and were saying shit when I was drinking yesterday.

Here is the edited reply.

1) I have an article “lose weight guide” and want to auto-link to another page "how to lose weight".
2) in the article “lose weight guide”, I might choose anchors like, "losing weight", "lose weight", "how to lose weight", etc. They all can be the proper anchors to link to page "how to lose weight"
3) But all those anchors might have multiple occurrences in the article, it's normal to have the key phrase "losing weight" to show up, like 10 times in the article.
4) if you only run cosine similarity using anchors against other articles, you get the same score.
4) So the cosine similarity for the entire h2 section/ paragraph only matters when you need to choose which one of the 10 "losing weight" to use. As you want to choose the anchor with the most related contextual juice with the target page.
5) So that's basically it. The key point is to add randomness. If you don't add randomness, you will find all the same anchors are used to point to the target page because it always has the same highest cosine score. You need to add variations in anchor choosing to make it natural.
6) To make it more extendable and make it also work on white hat sites, meaning you might want to manually add/change links for your money pages. Then you have to add the auto-detect and auto-linking part in Wordpress PHP. If you make it in python and all internal links hardcoded, you lose the ability to manually change it afterward. Only do the anchors/AI heavy-lifting stuff in python and once you decide the anchors and target pages, store them in post custom fields.

So the hard part is to pick the most natural anchors which real people would do, with all types of variations. like "Adding a countertop", "cost of kitchen cabinets"... If you only pick noun, entities as anchors, it looks non-human.
The second hard part is the auto-linking algo in PHP if you want to make it compatible with manual link edits for future. If you do not need this, and only set up an auto-site and leave it, then you can hard code the internal links which is far easier.
 
Thanks for the detailed explainer @4440

I'll write about how it is handled at my end.

There are a few things happening

1. Every piece of Content I create has a primary keyword - this is usually stored in Yoast - as the content is uploaded on the site.

2. I experiment a lot with Keyword in title and keyword not-in-title - so I figured simply running a similarity against the title may not serve my purpose.

3. The system performs a Topical Resonance Analysis and an N-gram analysis to extract key contextual topics from our prose and snippets minus the headings.

4. It then runs an N-Gram Analysis on the titles and sub-headings and keywords now.

5. These are both saved in a separate dataframes.

6. It then runs a similarity analysis between these two. Anywhere a preset similarity is > n (n being a pre-decided number) - the corresponding prose gets a hyper-link.

//

Randomisation is already at play. The script also runs analysis on the entire site first. Then randomly introduces links where the quantity of both outbound and inbound is pre-configured.

Finally, before the links are paired. The script takes into account the publish date. (I work a lot with scheduled content). And you don't want linking happening to future post.

Finally, post IDs, outbound links already created and inbound links are stored in a JSON DB.

This DB is referenced first when new content is added to the site and the script is run again.

//

I am learning embeddings and intend to use them heavily in my future interlinking endeavours.

2-3 days ago, OpenAI introduced reduced cost for their embedding models and I found it very interesting.

I ran a small test (non-OpenAI) on the below paragraph

Code:
Sharky, the tooth-brushing superhero, is a vibrant and dynamic character adored by children and adults alike for his unique approach to oral hygiene. Resembling a friendly shark with a dazzling smile, Sharky patrols the deep seas of Dentalville, armed with his indestructible Toothbrush Trident and his invincible Floss Lasso. His mission is to battle the nefarious Plaque Monsters and their leader, Cavity Creep, who constantly plot to spread tooth decay among unsuspecting citizens. 

Sharky's superpowers include generating fluoride foam and a sonic wave brushing technique that leaves teeth sparkling clean. His strength, agility, and unyielding commitment to dental health inspire kids around the world to maintain good brushing habits. Sharky is not just a hero; he is a guardian of smiles, ensuring that every child he encounters understands the importance of keeping their teeth clean and healthy. His catchphrase, "Brush twice a day and keep the cavities away!" echoes in the hearts of his young admirers, making Sharky a true superhero in the world of dental care.

And this is what it managed to extract simply employing embeddings.

Code:
- Sharky, the tooth-brushing superhero
- oral hygiene
- Dentalville
- Toothbrush Trident
- Floss Lasso
- Plaque Monsters
- Cavity Creep
- tooth decay
- fluoride foam
- sonic wave brushing technique
- dental health
- brushing habits
- guardian of smiles
- keep the cavities away

Pretty happy with the context extracted. What do you think?
 
Hey all,

So - I have been building auto-generated / AI / scraped sites since a long time now.

I train my own AI NLP models to generate text on the fly and to generate context relevant content. I scrape all day, everyday. I have tackled the hardest of niches for the longest of tails.

My sites have been shared on this forum, as well as Reddit and even a couple Russian forums (really proud of that last one).

To read more about me - check the intro post on my Ask Me Anything thread

This journey comes as a sort of a challenge from a couple of fellow webmasters. We were brainstorming an ideation and comparing notes on where we were in our AI / Auto-generation journey - when we decided to launch brand new projects and document our journey as we go along.

New slack channels were created instantly. And we set about. I decided to share and document my journey on this forum as well.

The Journey

I will build 5 sites in total - they'll start from scratch. Each site will be different from the other. I'll try to layout everything about them below.

I have built custom tools, scripts and APIs, which either do all the job from scraping to posting, or are fragmented to do one part of the job.

I'll be using WordPress except for Site # 4 - which is a custom script I put together once to test a prototype.

Monetization

I run several websites - both whitehat and blackhat. I have primarily been doing PPC, CPA, Lead-gen and a little of Adsense. This journey will finally propel me to start using Display.

Let's begin -

_________________________________________________________________________

Site # 1

Site Type -
PAA Only. No rewriting. (Augmented in a few places using AI)

Featured Images - No.

Domain Type - Fresh, Not Registered Before

Current Progress - See below

Indexing API - Yes

----

Site # 2

Site Type
- Fine-tuned AI model generated content + a sprinkling of rewritten PAA using my custom paraphraser. Heavily augmented using AI. (Tons of unique semantically relevant AI Content added to every WordPress Post)

Featured Images - Yes, beautiful unique custom images retrieved from APIs, modified and parsed by my system and posted to the article.

Domain Type - Expired, re-registered. Currently has 20 AI generated posts, all indexed and ranking.

Current Progress - See below

Indexing API - No

----

Site # 3

Site Type -
A unique twist on PAA setup, that I've had some success with. Unable to reveal more.

Featured Images - No

Domain Type - New, never registered before.

Current Progress - See below.

Indexing API - Yes

----

Site # 4

Site Type -
Rewritten PAA site - all answers are paraphrased. (Does not follow the traditional WordPress format)

Featured Images - No

Domain Type - New, never registered before.

Current Progress - See below.

Indexing API - Yes

----

Site # 5

Site Type -
Pure AI Generated Content. No PAA, No questions, No scraping. Just AI generated content.

Featured Images - Yes, beautiful unique custom images retrieved from APIs, modified and parsed by my system and posted to the article.

Domain Type - New, never registered before.

Current Progress - See below.

Indexing API - No

_________________________________________________________________________

Current Progress

Site # 1


- Domain Registered
- Keyword Research done
- 178K keywords extracted, sanitized

//

Site # 2

- Domain Registered (Site was already live)
- Keyword Research done
- 116K keywords extracted, sanitized, grouped into categories

//

Site # 3

- Domain Registered
- Keyword Research done
- 180K keywords extracted, sanitized, grouped into categories

//

Site # 4

- Domain Registration Pending
- Keyword Research done
- 160K keywords extracted, sanitized, grouped into categories

//

Site # 5

- Domain Registered
- Keyword Research done
- 129K keywords extracted, sanitized, grouped into categories

_________________________________________________________________________

People also ask

1.
How are you generating content?

I have my own fine-tuned models (both GPT3 and Non-GPT3). I am always looking at datasets and training models.

2. What do you use for Paraphrasing?

This is more dependent on my end needs. But I end up using T5 a lot - obviously with a lot of custom trained models. If you're looking to get started with paraphrasing - read my post here - https://huggingface.co/ramsrigouthamg/t5-large-paraphraser-diverse-high-quality

Beyond this, @Cognitive has a lovely post on extracting semantics of an article and then using AI to further enhance it. I have something similar that I work with, not to his scale - but it does the job well.

3. Do you connect GSC/GA to this site?

Yes, all my sites have GSC albeit with different accounts. I often alternate between GA and Matomo

4. What language are your scripts in?

Python mostly. Some node.js.

5. Where are your scripts hosted?

I have a couple massive GPU setups in the office to be able to run text-transformers. I also have a few servers at Vultr and AWS.

6. Will you sell your script/setup?

Nope.

7. What about backlinks?

I build social signals (automated) and Web 2.0 - when launching a new site. That's the extent of it for my BH sites.

8. What niche are you working on?

Prefer not to talk about that.

8. I have more questions

Respond here, I'll answer where I can.

_________________________________________________________________________

Also

Three incredible people doing stuff with automation are @Sartre, @spectrejoe and @Preon - you should definitely follow their journey and many of your questions will be answered.

Follow @Sartre's journey here - https://www.blackhatworld.com/seo/j...sing-ai-generated-content-lets-do-it.1360940/

Follow @spectrejoe's journey here - https://www.blackhatworld.com/seo/s...-content-to-700-month-self-coded-bot.1323860/

Follow @Preon's journey here - https://www.blackhatworld.com/seo/s...ontent-to-100-000-page-views-a-month.1313311/

_________________________________________________________________________

Updates

I'll be updating this thread once or twice a week. I'll be answering questions more frequently though.

These are not the only projects I'll be working on. The core idea behind automation is scaling. So I need to keep building new sites plus augmenting my existing ones.

_________________________________________________________________________

Let's fucking go!
How are you organizing and cataloging the links in your project?
Wishing you the best as you navigate through your endeavors!
 
And this is what it managed to extract simply employing embeddings.

Code:
- Sharky, the tooth-brushing superhero
- oral hygiene
- Dentalville
- Toothbrush Trident
- Floss Lasso
- Plaque Monsters
- Cavity Creep
- tooth decay
- fluoride foam
- sonic wave brushing technique
- dental health
- brushing habits
- guardian of smiles
- keep the cavities away

I don't know about what type of content are primary in your sites. From the list, only "keep the cavities away" has verbs, which can be used to link to pages like "how to keep the cavities away", "methods/tips to keep the cavities away". All others are Nouns, or entities that are only suitable to link to "what is oral hygiene", "what is tooth decay" type of posts. Although you can still use the anchor "tooth decay" to link to a page titled "how to prevent tooth decay" , but real humans would use anchors like "preventing tooth decay" as the anchor.

From your example text:
Sharky, the tooth-brushing superhero, is a vibrant and dynamic character adored by children and adults alike for his unique approach to oral hygiene. Resembling a friendly shark with a dazzling smile, Sharky patrols the deep seas of Dentalville, armed with his indestructible Toothbrush Trident and his invincible Floss Lasso. His mission is to battle the nefarious Plaque Monsters and their leader, Cavity Creep, who constantly plot to spread tooth decay among unsuspecting citizens.

Sharky's superpowers include generating fluoride foam and a sonic wave brushing technique that leaves teeth sparkling clean. His strength, agility, and unyielding commitment to dental health inspire kids around the world to maintain good brushing habits. Sharky is not just a hero; he is a guardian of smiles, ensuring that every child he encounters understands the importance of keeping their teeth clean and healthy. His catchphrase, "Brush twice a day and keep the cavities away!" echoes in the hearts of his young admirers, making Sharky a true superhero in the world of dental care.

These bold anchors are more suitable to be used to link to "how-to" type of posts. You should try to extract more like these, which makes your internal link look exactly like human.

EDITED: I see some people use AI to generate texts as the anchor. So if your sites are fully auto sites which you generate the site 1-click and leave. You can use AI to generates or edit the texts for you a little bit. Then it is way easier. You only need need to roughly select the anchor, for example "tooth decay". If you need to link to a post like "how to prevent tooth decay", you can just use gpt to rewrite the paraghps for you and change tooth decay to "preventing tooth decay" or "to prevent tooth decay", etc.

2) And I am not sure how you extract these by using only embeddings. I am a little bit confused.
1. Every piece of Content I create has a primary keyword - this is usually stored in Yoast - as the content is uploaded on the site.

2. I experiment a lot with Keyword in title and keyword not-in-title - so I figured simply running a similarity against the title may not serve my purpose.

3. The system performs a Topical Resonance Analysis and an N-gram analysis to extract key contextual topics from our prose and snippets minus the headings.

4. It then runs an N-Gram Analysis on the titles and sub-headings and keywords now.

5. These are both saved in a separate dataframes.

6. It then runs a similarity analysis between these two. Anywhere a preset similarity is > n (n being a pre-decided number) - the corresponding prose gets a hyper-link.
I only have a feeling about your process because I did not see your code or any example site so I can only guess.

My guess is that, a post can have multiple topics. So linking using n-gram or topics may not be a good idea. Too many topics are noises. Using only the title is probably the best idea. Or use your primary keyword for comparing because you already store them.

For example. an article titled "Brush teeth guide" may have many sub-topics as H2, H3, like "why brush teeth", "how to brush teeth", "types of toothbrushes" etc. If you consider all topics like these, these are just noises. The title is enough. If you have an anchor "brush teeth" and you run a cosine similarity against the title "brush teeth guide", you get a high score like 0.9. In reality, when people click a link, the title is what readers expect the content to be. So I think using title is enough.

I don't know why you say using title cannot serve your purpose. Maybe your title is fancy with too many CTR words? like "Proved: Click now, how to brush teeth, 2024". If you use very simple titles like "How to brush your teeth correctly" or "how to brush teeth", or use your primary keywords as the title. The title is enough and accurate to be used for comparison.

Still, I don't know your process and obviously you don't know mine because we cannot share the exact code here and it seems we are on totally different paths/methodologies on how to do internal linking. But your way definitely works because implementing linking is easy. The only difference is which anchor is chosen by our algo.
 
Code:
- Sharky, the tooth-brushing superhero
- oral hygiene
- Dentalville
- Toothbrush Trident
- Floss Lasso
- Plaque Monsters
- Cavity Creep
- tooth decay
- fluoride foam
- sonic wave brushing technique
- dental health
- brushing habits
- guardian of smiles
- keep the cavities away

I don't know about what type of content are primary in your sites. From the list, only "keep the cavities away" has verbs, which can be used to link to pages like "how to keep the cavities away", "methods/tips to keep the cavities away". All others are Nouns, or entities that are only suitable to link to "what is oral hygiene", "what is tooth decay" type of posts. Although you can still use the anchor "tooth decay" to link to a page titled "how to prevent tooth decay" , but real humans would use anchors like "preventing tooth decay" as the anchor.

Topical relation mapping, is fine. Grammatical categorisation is great too.

There is no set rule that a verb has to link to a how-to page or a noun has to link to a what/when/which page. This is true even for manual internal links.

I have thoroughly analysed the internal links of most well ranked sites, and like content and structure, these are highly preferential too.

I follow a hierarchical site system with a rather strict Silo.

When following an internal linking strategy. I ensure sibling, parent and taxonomy are extracted.

My guess is that, a post can have multiple topics. So linking using n-gram or topics may not be a good idea. Too many topics are noises. Using only the title is probably the best idea. Or use your primary keyword for comparing because you already store them.

Agreed that a long form post can have multiple sub topics.

I guess I should've named the linker and the linkee article.

The n-gram and topical analysis is used on the linker article to extract topics that can be used as anchors. This is what I am using embeddings for. Because embeddings within a model can help me fine tune the kind of anchors I want.

These extracted topics are then compared against the other articles' titles and seed keywords. Then factors such as

- How many total internal links the article has?
- How many total articles are linking to a target page?
- Sibling page
- Keyword cluster
- Similarity between the selected anchor and the article it is part or (to prevent keyword cannibalism)
- Cornerstone content Links
- Etc. are considered.

before the link is created.

I don't know why you say using title cannot serve your purpose. Maybe your title is fancy with too many CTR words? like "Proved: Click now, how to brush teeth, 2024". If you use very simple titles like "How to brush your teeth correctly" or "how to brush teeth", or use your primary keywords as the title. The title is enough and accurate to be used for comparison.

I am not saying that the title doesn't help. And yes, I experiment a lot with titles for CTR purposes.

I isolate entire Silos to determine how CTR improves, merely based on titles.

What I am saying is, having a primary root keyword (that my AI uses several times across the article and seeds other adjacent keywords from) gives me better relational mapping.

Still, I don't know your process and obviously you don't know mine because we cannot share the exact code here and it seems we are on totally different paths/methodologies on how to do internal linking. But your way definitely works because implementing linking is easy. The only difference is which anchor is chosen by our algo.

You're right in that our approach greatly differs.

It's excellent to discuss this.
 
What I am saying is, having a primary root keyword (that my AI uses several times across the article and seeds other adjacent keywords from) gives me better relational mapping.
1)
Yes the "primary root keyword" is the best to use when doing comparison / relational mapping.
I mixed AI articles and other articles, so I don't have the "primary root keyword" for every article. So I choose to use the title to achieve consistent results.
In your case, the primary root keyword is indeed the best to use for comparison / relational mapping.


2)
I see you interlink based on SILO that you mention a lot on sibling/cornerstone pages/cluster/sub-topics.
Theoretically, your way is better for building a knowledge graph for your entire site so I can't wait to see your future work on this.
 
Have you tried using GPT-4 yet and what is your experience? I find it qualatative sometimes better but not worth the additional price.
 
Back
Top
AdBlock Detected

We get it, advertisements are annoying!

Sure, ad-blocking software does a great job at blocking ads, but it also blocks useful features and essential functions on BlackHatWorld and other forums. These functions are unrelated to ads, such as internal links and images. For the best site experience please disable your AdBlocker.

I've Disabled AdBlock