1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

PHP script stripping urls down to domain level

Discussion in 'PHP & Perl' started by loopline, Apr 26, 2010.

  1. loopline

    loopline Jr. VIP Jr. VIP

    Joined:
    Jan 25, 2009
    Messages:
    3,382
    Likes Received:
    1,801
    Gender:
    Male
    Home Page:
    In another thread here on BHW a member posted 2 different scripts.

    The first script pulls URLs from an xml file and saves them off to a .txt file each time the script is executed. The page with the xml file changes very often as it is a log. The only drawback to this script is that it appends the results to the same .txt file, thus it can get very large very quick.

    The same user shortly there after posted a rebuild of the code, and it creates a new .txt file each time and increases the file name by 1 each time. Very smart and it does this part well. However it somehow strips the urls down to the domain level. So site.com/post/something.html would end up as just site.com. I want the full url and I can't figure out what is causing this.

    I contacted the original OP but they aren't responding. I am very grateful for the code they posted, but would just like to have the best of both the scripts.

    I know just enough php to be dangerous.. :rolleyes: Was wondering if anyone here can help me out. I want to make the second script keep the full url like the first one does. So here are the scripts.

    Script one, Full ulrs, but appends results to the same file:

    Code:
    <?php
    $url = 'http://blogsearch.google.com/changes.xml?last=120';
    $m= file_get_contents ($url);
    preg_match_all ('/url="(.*?)"/',$m,$match );
    print implode("<br>",$match[1]);
    $file=fopen("links.txt","a+");
    fwrite($file,implode("\r\n",$match[1]));
    fclose($file);
    ?>                      
    Urls from script one look like this:


    Code:
    http://kalasznikow.pl/firmy/meble,wypoczynkowe,s,1282/
    http://firabercerita.blogspot.com/
    http://www.swiadectwaenergetyczne.edu.pl/tagi/certyfikaty-energetyczne-budynku-poznan,kurs-swiadectw-energetycznych-budynku,certyfikat-energetyczny-budynku,certyfikaty-energetyczne-w-wielkopolsce/strona5/3615.html
    http://cid-de8c539e36e13a39.spaces.live.com/blog/cns!DE8C539E36E13A39!107.entry
    http://www.nflsportsmemorabilia.com/
    http://beccastonemets.blogspot.com/
    http://search.ebay.co.uk/search/search.dll?siteId=3&from=R6&satitle=Josephine+Cox&nojspr=y&customid=Cox&fsoo=2&fsop=32&fbfmt=1&saaff=afepn&sascs=0&sabfmts=0&afepn=5336251050&dfsp=32&ssPageName=RSS:B:SRCH:GB:100
    http://talimotallom1.blogfa.com
    http://ahongryguy.blogspot.com/
    http://melanoma.selfip.com/prescription-medicine-search-engine.asp
    Script two, saves each execution results to a different file, but strips urls down to the domain level.

    Code:
    <?php 
    $url = 'http://blogsearch.google.com/changes.xml?last=120'; 
    $m= file_get_contents ($url); 
    preg_match_all ('/url="(.*?)"/',$m,$match ); 
    $c=file_get_contents("count.txt"); 
    $c=trim($c); 
    $c=$c+1; 
    $count=count($match[1]); 
    echo $count; 
    for($i=0; $i<$count; $i++){ 
    $blog=parse_url($match[1][$i]); 
    $all.=$blog['host']."\n";} 
    $filename = 'blogs_'.$c.'.txt'; 
    $fp = fopen($filename,"w"); 
    fputs($fp,$all); 
    fclose($fp); 
    $fc= fopen('count.txt',"w"); 
    fputs($fc,$c); 
    fclose($fc); 
    ?>                      
    The urls from script two look like this:

    Code:
    somnial.spaces.live.com
    [U]www.prosty.e-bussiness.tk[/U]
    www.yourmemories.ro
    wheretoeatclub.com
    downrightnow.com
    www.allbusiness.com
    walklikeaboy.blogspot.com
    proudd-mary-keeps-on-burning.blogspot.com
    www.yatech.pl
    neowebsite.net
    Thoughts?

    Thanks in advance for your time, I know you all are very busy.

    MAtt
     
  2. Petrel

    Petrel Registered Member

    Joined:
    Aug 27, 2009
    Messages:
    62
    Likes Received:
    10
    The second script uses the function parse_url to extract the host.

    $blog=parse_url($match[1][$i]); <-- delete this line


    Replace the following:
    Code:
    $all.=$blog['host']."\n";} 
    with:
    Code:
    $all.=$match[1][$i]."\n";} 
    As "$match[1][$i]" is sent to parse_url, it must be the full url. No?
     
    • Thanks Thanks x 1
  3. loopline

    loopline Jr. VIP Jr. VIP

    Joined:
    Jan 25, 2009
    Messages:
    3,382
    Likes Received:
    1,801
    Gender:
    Male
    Home Page:
    Pure genius my friend. You have my thanks, it worked like a charm!
     
    • Thanks Thanks x 1
  4. loopline

    loopline Jr. VIP Jr. VIP

    Joined:
    Jan 25, 2009
    Messages:
    3,382
    Likes Received:
    1,801
    Gender:
    Male
    Home Page:
    Thanks M.A.D. but thats exactly what I DIDN'T want it to do. Sorry if I was unclear. Thanks for taking the time to post, and make clear instructions of it. I appreciate your time and your orignial share which is related to this code. I already gave you rep, or I would give you some more. :)

    MAtt