PHP regex please help me!

terrycody

Elite Member
Joined
Sep 29, 2012
Messages
6,058
Reaction score
3,133
I don't know programming at all! But I am using a wordpress PHP plugin, it seems use some regular expression to find and replace the HTML tags in the article.

So I meet some problems when using it, please give me some tips if you know how to solve these problems! Thank you in advance!

Question 1:

Say, if there are some useless <br /> tags in the article, and I want to remove all of them without affect other contents, how to remove it? I googled and found this expression:

Code:
</?br(|\s+[^>]+)>

But, its not working! What is the right one?


Question 2:

Say, there is a <div> element in your article with specific ID and Class attribute. Within it, there are also some wrapped <div> elements, lets say 12, so the expression to find this DIV and all of its content is:

Code:
<div id="......" class="......".*?>(.*?</div>){12}

And its working! But when another article has same DIV but with different number of within elements, lets say this time, 13, when this happen, this code will not work and cause problems.

So if there is a way to exactly remove this specific DIV element (with this ID and Class) and all of its contents between it, but without affecting other parts?


Question 3:

if there is a DIV starts like this:

<div class="pic-placeholder" style="padding-bottom:66.9%;">

If we want to ONLY remove this DIV tag with specific ID or CLASS, but NOT contents between the tag,

will this code work?

Find: </?div class="pic-placeholder"(|\s+[^>]+)>
Replace: (Leave empty)

Or could you please tell me the right expression?





Thank you!
 
First one of removing br tags can be easily done with strip_tags();

Now the second one is you have to hook into the theme side where this output is being processed. It looks like the DIV ID's are dynamically being allocated 12, 13 and 14... so on and so forth. Thus writing a code that will handle it based on DIV ID will be better.
 
First one of removing br tags can be easily done with strip_tags();

Now the second one is you have to hook into the theme side where this output is being processed. It looks like the DIV ID's are dynamically being allocated 12, 13 and 14... so on and so forth. Thus writing a code that will handle it based on DIV ID will be better.

Thank you for your reply!

I don't know coding, so could you give me the exact expression of this 2?

Btw, the question 2, the ID and CLASS of <div> is same in every article of that website, so no need worry about that, just the different number items wrapped in it, think like you have a article navigator bar at the beginning of a post, article 1 maybe has 7 items, while article 2 has 10 items. But this DIV element always has same ID and CLASS, could you give a right expression of this?
 
I don't know programming at all! But I am using a wordpress PHP plugin, it seems use some regular expression to find and replace the HTML tags in the article.



So I meet some problems when using it, please give me some tips if you know how to solve these problems! Thank you in advance!



Question 1:



Say, if there are some useless <br /> tags in the article, and I want to remove all of them without affect other contents, how to remove it? I googled and found this expression:



Code:
</?br(|\s+[^>]+)>



But, its not working! What is the right one?





Question 2:



Say, there is a <div> element in your article with specific ID and Class attribute. Within it, there are also some wrapped <div> elements, lets say 12, so the expression to find this DIV and all of its content is:



Code:
<div id="......" class="......".*?>(.*?</div>){12}



And its working! But when another article has same DIV but with different number of within elements, lets say this time, 13, when this happen, this code will not work and cause problems.



So if there is a way to exactly remove this specific DIV element (with this ID and Class) and all of its contents between it, but without affecting other parts?





Question 3:



if there is a DIV starts like this:



<div class="pic-placeholder" style="padding-bottom:66.9%;">



If we want to ONLY remove this DIV tag with specific ID or CLASS, but NOT contents between the tag,



will this code work?



Find: </?div class="pic-placeholder"(|\s+[^>]+)>

Replace: (Leave empty)



Or could you please tell me the right expression?











Thank you!


QUESTION 1

Instead of using regular expression. why not use str_replace function


Code:
<?php

echo str_replace("<br>"," ","ashhdkjashdkasd <br> asjdhkjasdkjasd <br>");

?>

ref for you to understand - https://www.w3schools.com/php/func_string_str_replace.asp

If you want to use regex then use this one
Code:
<br\W*?\/>


http://prntscr.com/umtabq
QUESTION 2

Can you please share the exact html for this?
I would suggest going with dom elements and Javascript. https://www.w3schools.com/js/js_htmldom.asp
You can easily install Custom JS script on wordpress aswell with plugins

Question 3

Use the function from here

https://stackoverflow.com/questions/5696412/how-to-get-a-substring-between-two-strings-in-php
you have added an extra / in the code

Code:
<?div class="pic-placeholder"(|\s+[^>]+)>

Try with this. i suggest try cleaning the class and style elements instead of div or it can affect your website structure

http://prntscr.com/umtc5e
 
Last edited:
For 2nd question as nawafdevil said you can use str_replace on that whole string to find that particular value and replace it with nothing.

Example :

$find = "whatever you want to find, in your case the DIV tag.
$string = str_replace($find, "", $string);

And as you can't remove all </div>

Just do a sub_str
Example : $string = trim(sub_str($string, 0, -6));
 
QUESTION 1

Instead of using regular expression. why not use str_replace function


Code:
<?php

echo str_replace("<br>"," ","ashhdkjashdkasd <br> asjdhkjasdkjasd <br>");

?>

ref for you to understand - https://www.w3schools.com/php/func_string_str_replace.asp

If you want to use regex then use this one
Code:
<br\W*?\/>


http://prntscr.com/umtabq
QUESTION 2

Can you please share the exact html for this?
I would suggest going with dom elements and Javascript. https://www.w3schools.com/js/js_htmldom.asp
You can easily install Custom JS script on wordpress aswell with plugins

Question 3

Use the function from here

https://stackoverflow.com/questions/5696412/how-to-get-a-substring-between-two-strings-in-php
you have added an extra / in the code

Code:
<?div class="pic-placeholder"(|\s+[^>]+)>

Try with this. i suggest try cleaning the class and style elements instead of div or it can affect your website structure

http://prntscr.com/umtc5e

Thank you so much! Grateful we have many experts like you! I guess this is near the answer I want, I tried BR one, its not working, but I guess its my problem, not all BR tags write like <br />, some like this:

Code:
<p><span class="ordered-list__counter-count js-counter-nav" data-list_after_index="3"> 02 </span><br>
<span class="ordered-list__counter-progress">of 10</span></p>

As you can see the <br>, so I think this expression need tweak a bit, but I don't know how to change it.

For question 2, yes for example the URL:
Code:
https://www.thesprucepets.com/african-clawed-frogs-as-pets-1236809

As you can see its navigator in the beginning, many of its articles will have this navigator, and a wordpress scrape plugin have to use expression like

<div id="......" class="......".*?>(.*?</div>){12} to remove this part.

The plugin author gave me this and it works! But the total wrapped tags within this navigator is not same in every article! So this "12" need to change to meet this unknown pattern, as you can see the code in this URL:

Code:
<div id="toc_1-0" class="comp toc mntl-toc mntl-block" data-offset="70">

Although we can't sure the total tags wrapped in this DIV, but this DIV will always has same ID and class in every page (if exist), so is there a way to write a better regex?

The plugin author seems do not want to support anymore lol, though I bought the license, really suck.

Question 3:

when I tested this regex I will report back if this work or not!

Thanks so much for your guys precious time! Really really appreciated!


For 2nd question as nawafdevil said you can use str_replace on that whole string to find that particular value and replace it with nothing.

Example :

$find = "whatever you want to find, in your case the DIV tag.
$string = str_replace($find, "", $string);

And as you can't remove all </div>

Just do a sub_str
Example : $string = trim(sub_str($string, 0, -6));

To be honest, I can't even understand 1 line of your code, so I am really have difficult to use these things! That's why I can only mimic the expression already exist lol
 
I have used $string just an example as I don't know the actual variable which has been used in the code on your side.

The real solution is not that part but in understanding that str_replace and sub_str can fix your issue for once and for all.
 
@terrycody

Question 1
Code:
<br\W*?\>
use this regex for <br> and for <br /> use the one i sent above.

Question 2

can i get the exact url to check the current working regex and the regex you are using for that url? i am not that good with regex. but i believe instead of {12} can you try {*} to match any number and check if it works?
 
use this regex for <br> and for <br /> use the one i sent above.

Question 2

can i get the exact url to check the current working regex and the regex you are using for that url? i am not that good with regex. but i believe instead of {12} can you try {*} to match any number and check if it works?

hi Mate!


Question 1, I guess its my problem, these regex should work, the <br> tags is the wordpress theme generated after scrape so I can't remove it, so we don't need to worry it maybe because we can't do anything lol.


Question 2:

Yes, this regex:
Code:
<div id="toc_1-0" class="comp toc mntl-toc mntl-block".*?>(.*?</div>){12}
working for URL:
Code:
https://www.thesprucepets.com/about-fennec-foxes-as-pets-1236778

Author said "There are 12 closing divs for that one, so the rule should be like this", and it works yeah, but when we scrape similar article which contain a navigator:
Code:
https://www.thesprucepets.com/african-clawed-frogs-as-pets-1236809

Its not working! As we can guess, the total closing divs are not 12 anymore! And yes, I tried your * replace 12, not working, and even I randomly put a number, like 20, or so, not working too.

Ummmmmm, I hope the author be responsible for these headaches lol



thanks again for your kindly help ;)
 
hi Mate!


Question 1, I guess its my problem, these regex should work, the <br> tags is the wordpress theme generated after scrape so I can't remove it, so we don't need to worry it maybe because we can't do anything lol.


Question 2:

Yes, this regex:
Code:
<div id="toc_1-0" class="comp toc mntl-toc mntl-block".*?>(.*?</div>){12}
working for URL:
Code:
https://www.thesprucepets.com/about-fennec-foxes-as-pets-1236778

Author said "There are 12 closing divs for that one, so the rule should be like this", and it works yeah, but when we scrape similar article which contain a navigator:
Code:
https://www.thesprucepets.com/african-clawed-frogs-as-pets-1236809

Its not working! As we can guess, the total closing divs are not 12 anymore! And yes, I tried your * replace 12, not working, and even I randomly put a number, like 20, or so, not working too.

Ummmmmm, I hope the author be responsible for these headaches lol



thanks again for your kindly help ;)
Will take a look at this when i get time. Will message if i am successful
 
@terrycody The closest i got was somewhere till here
Code:
<div id="toc_1-0" class="comp toc mntl-toc mntl-block".*?>(.*?Back to Top)

The code misses the final DIv and gives you the content above it.

This is the one which should be working . but idk whats wrong

Code:
<div id="toc_1-0" class="comp toc mntl-toc mntl-block".*?>(.*?Back to Top<\/button><\/div>)
 
The closest i got was somewhere till here

Dear mate, this one is working I think! Really Cool! And yes, 2nd one is not work, will leave everything blank, as I said before, trying to input a large number replace that "12" like 20 for example, will also cause everything blank, so no idea lol, even you no idea about this shit, what do I know!

I think you mean 1st one will leave some useless tags in the HTML file right? I can see a useless elements like:

<p></button></p>
<p id="mntl-sc-block_1-0" class="comp mntl-sc-block mntl-sc-block-html">

but they don't show on the front end (in article) I even tried another similar URL, so yes its okay, we don't need perfection this is already much much better, we tried out best lol.

mate, I don't know how much I can say thank you! Really grateful you are here to help!


Please allow me ask more question, but never rush yourself, just when you have time and boring and willing to solve this shitty code:

again, in the example URL:
Code:
https://www.thesprucepets.com/african-clawed-frogs-as-pets-1236809

You can see there are 2 related article links inserted
Related:What Do Frogs Eat?
Related:How to Choose the Right Pet Frog

if we check their HTML:

Code:
<a href="https://www.thesprucepets.com/what-frogs-eat-4584340" id="mntl-sc-block-featuredlink__link_1-0" class=" mntl-sc-block-featuredlink__link mntl-text-link" data-tracking-container="true" data-tracking-id="featured-link-download"><span class="link__wrapper">What Do Frogs Eat?</span></a>

All these inserted related in this website will use <a> element with this same ID and CLASS! Only difference, the href will change, of course like:

Code:
https://www.thesprucepets.com/what-frogs-eat-4584340

https://www.thesprucepets.com/bla-bla-bla

https://www.thesprucepets.com/shit-shit-shit

So now I want to completely remove this thing, I don't even want the pure text, I need remove everything between this <a> element, So I tried:

Code:
<a id="mntl-sc-block-featuredlink__link_1-0" class="mntl-sc-block-featuredlink__link mntl-text-link".*?>.*?</a>

But not working, I guess we should add a general href before it right? But how to write the right expression to match all possible URLs in this case?

I guess something like:
Code:
<a href="https://www.thesprucepets.com/ ******whatever?" id="mntl-sc-block-featuredlink__link_1-0" class="mntl-sc-block-featuredlink__link mntl-text-link".*?>.*?</a>

but you know, I don't know this :(
 
Dear mate, this one is working I think! Really Cool! And yes, 2nd one is not work, will leave everything blank, as I said before, trying to input a large number replace that "12" like 20 for example, will also cause everything blank, so no idea lol, even you no idea about this shit, what do I know!

I think you mean 1st one will leave some useless tags in the HTML file right? I can see a useless elements like:

<p></button></p>
<p id="mntl-sc-block_1-0" class="comp mntl-sc-block mntl-sc-block-html">

but they don't show on the front end (in article) I even tried another similar URL, so yes its okay, we don't need perfection this is already much much better, we tried out best lol.

mate, I don't know how much I can say thank you! Really grateful you are here to help!


Please allow me ask more question, but never rush yourself, just when you have time and boring and willing to solve this shitty code:

again, in the example URL:
Code:
https://www.thesprucepets.com/african-clawed-frogs-as-pets-1236809

You can see there are 2 related article links inserted
Related:What Do Frogs Eat?
Related:How to Choose the Right Pet Frog

if we check their HTML:

Code:
<a href="https://www.thesprucepets.com/what-frogs-eat-4584340" id="mntl-sc-block-featuredlink__link_1-0" class=" mntl-sc-block-featuredlink__link mntl-text-link" data-tracking-container="true" data-tracking-id="featured-link-download"><span class="link__wrapper">What Do Frogs Eat?</span></a>

All these inserted related in this website will use <a> element with this same ID and CLASS! Only difference, the href will change, of course like:

Code:
https://www.thesprucepets.com/what-frogs-eat-4584340

https://www.thesprucepets.com/bla-bla-bla

https://www.thesprucepets.com/shit-shit-shit

So now I want to completely remove this thing, I don't even want the pure text, I need remove everything between this <a> element, So I tried:

Code:
<a id="mntl-sc-block-featuredlink__link_1-0" class="mntl-sc-block-featuredlink__link mntl-text-link".*?>.*?</a>

But not working, I guess we should add a general href before it right? But how to write the right expression to match all possible URLs in this case?

I guess something like:
Code:
<a href="https://www.thesprucepets.com/ ******whatever?" id="mntl-sc-block-featuredlink__link_1-0" class="mntl-sc-block-featuredlink__link mntl-text-link".*?>.*?</a>

but you know, I don't know this :(

Can you try this and let me know if this works?
Code:
(?s)<div id="mntl-sc-block_1-0-42".*?<\/div>
 
Can you try this and let me know if this works?
Code:
(?s)<div id="mntl-sc-block_1-0-42".*?<\/div>

Its not work mate, I tried multiple times, everything went blank.

I think we need work on <a> element, and like I said, every related link in this website will looks like:

Code:
<a
href="https://www.thesprucepets.com/what-frogs-eat-4584340"
id="mntl-sc-block-featuredlink__link_1-0-1"
class=" mntl-sc-block-featuredlink__link mntl-text-link"
data-tracking-container="true"
data-tracking-id="featured-link-download"
><span class="link__wrapper">What Do Frogs Eat?</span>
</a>

<a> element always start with "href" then "id" then "class", and ID and class will never change! After checked many URLs I think it only has 2 version ID:

id="mntl-sc-block-featuredlink__link_1-0-1"
id="mntl-sc-block-featuredlink__link_1-0"

So nothing special, and I tried with this Regex but not working:

Code:
<a href="https://www.thesprucepets.com/*" id="mntl-sc-block-featuredlink__link_1-0-1" class="mntl-sc-block-featuredlink__link mntl-text-link".*?>.*?</a>

But it should work like this structure! I mean, just I don't know how to write this part:

Code:
href="https://www.thesprucepets.com/*"

This part must be match any URL slugs, then it may work! If we figure this out...

So what do ya think mate?
 
Back
Top