PHP script stripping urls down to domain level

loopline · Apr 26, 2010

In another thread here on BHW a member posted 2 different scripts.

The first script pulls URLs from an xml file and saves them off to a .txt file each time the script is executed. The page with the xml file changes very often as it is a log. The only drawback to this script is that it appends the results to the same .txt file, thus it can get very large very quick.

The same user shortly there after posted a rebuild of the code, and it creates a new .txt file each time and increases the file name by 1 each time. Very smart and it does this part well. However it somehow strips the urls down to the domain level. So site.com/post/something.html would end up as just site.com. I want the full url and I can't figure out what is causing this.

I contacted the original OP but they aren't responding. I am very grateful for the code they posted, but would just like to have the best of both the scripts.

I know just enough php to be dangerous..

Was wondering if anyone here can help me out. I want to make the second script keep the full url like the first one does. So here are the scripts.

Script one, Full ulrs, but appends results to the same file:

Code:

<?php
$url = 'http://blogsearch.google.com/changes.xml?last=120';
$m= file_get_contents ($url);
preg_match_all ('/url="(.*?)"/',$m,$match );
print implode("<br>",$match[1]);
$file=fopen("links.txt","a+");
fwrite($file,implode("\r\n",$match[1]));
fclose($file);
?>

Urls from script one look like this:

Code:

http://kalasznikow.pl/firmy/meble,wypoczynkowe,s,1282/
http://firabercerita.blogspot.com/
http://www.swiadectwaenergetyczne.edu.pl/tagi/certyfikaty-energetyczne-budynku-poznan,kurs-swiadectw-energetycznych-budynku,certyfikat-energetyczny-budynku,certyfikaty-energetyczne-w-wielkopolsce/strona5/3615.html
http://cid-de8c539e36e13a39.spaces.live.com/blog/cns!DE8C539E36E13A39!107.entry
http://www.nflsportsmemorabilia.com/
http://beccastonemets.blogspot.com/
http://search.ebay.co.uk/search/search.dll?siteId=3&from=R6&satitle=Josephine+Cox&nojspr=y&customid=Cox&fsoo=2&fsop=32&fbfmt=1&saaff=afepn&sascs=0&sabfmts=0&afepn=5336251050&dfsp=32&ssPageName=RSS:B:SRCH:GB:100
http://talimotallom1.blogfa.com
http://ahongryguy.blogspot.com/
http://melanoma.selfip.com/prescription-medicine-search-engine.asp

Script two, saves each execution results to a different file, but strips urls down to the domain level.

Code:

<?php 
$url = 'http://blogsearch.google.com/changes.xml?last=120'; 
$m= file_get_contents ($url); 
preg_match_all ('/url="(.*?)"/',$m,$match ); 
$c=file_get_contents("count.txt"); 
$c=trim($c); 
$c=$c+1; 
$count=count($match[1]); 
echo $count; 
for($i=0; $i<$count; $i++){ 
$blog=parse_url($match[1][$i]); 
$all.=$blog['host']."\n";} 
$filename = 'blogs_'.$c.'.txt'; 
$fp = fopen($filename,"w"); 
fputs($fp,$all); 
fclose($fp); 
$fc= fopen('count.txt',"w"); 
fputs($fc,$c); 
fclose($fc); 
?>

The urls from script two look like this:

Code:

somnial.spaces.live.com
[U]www.prosty.e-bussiness.tk[/U]
www.yourmemories.ro
wheretoeatclub.com
downrightnow.com
www.allbusiness.com
walklikeaboy.blogspot.com
proudd-mary-keeps-on-burning.blogspot.com
www.yatech.pl
neowebsite.net

Thoughts?

Thanks in advance for your time, I know you all are very busy.

MAtt

Petrel · Apr 26, 2010

The second script uses the function parse_url to extract the host.

$blog=parse_url($match[1][$i]); <-- delete this line

Replace the following:

Code:

$all.=$blog['host']."\n";}

with:

Code:

$all.=$match[1][$i]."\n";}

As "$match[1][$i]" is sent to parse_url, it must be the full url. No?

loopline · Apr 27, 2010

Petrel said:
The second script uses the function parse_url to extract the host.

$blog=parse_url($match[1][$i]); <-- delete this line

Replace the following:

Code:

$all.=$blog['host']."\n";}

with:

Code:

$all.=$match[1][$i]."\n";}

As "$match[1][$i]" is sent to parse_url, it must be the full url. No?

Pure genius my friend. You have my thanks, it worked like a charm!

loopline · Apr 27, 2010

M.A.D said:

Trims to "example.com"

Code:

<?php
//strip to root domain function
function stringDomain($string){ 
    $d = explode('/',$string); 
    return str_replace('www.','',$d[2]); 
} 

//Get content
$m= file_get_contents ('http://blogsearch.google.com/changes.xml?last=120');
preg_match_all ('/url="(.*?)"/',$m,$match );

//open text file
$file=fopen("links.txt","a+");

//foreach loop the string trim function and save to the text file
foreach ($match[1] as $m){
    fwrite($file,stringDomain($m).'\r\n');    
}

//close the text file
fclose($file);
?>

Cheers.

Thanks M.A.D. but thats exactly what I DIDN'T want it to do. Sorry if I was unclear. Thanks for taking the time to post, and make clear instructions of it. I appreciate your time and your orignial share which is related to this code. I already gave you rep, or I would give you some more.

MAtt

PHP script stripping urls down to domain level

loopline

Elite Member

Petrel

Registered Member

loopline

Elite Member

loopline

Elite Member

Main Menu

Marketplace

Making Money

BlackHat World