1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Get the conntents Div For my web crawler

Discussion in 'Other Languages' started by mainceaft, Jul 2, 2015.

  1. mainceaft

    mainceaft Regular Member

    Joined:
    Apr 10, 2013
    Messages:
    379
    Likes Received:
    39
    Hi . I'm doing some work o simple PHP pages crawler . I'm still test it ad not even reach step to make it automatically crawl ad store pages in DB .
    First thing I faced after success crawling first page . is detecting Contents div .
    I can do this manually and define contents div Id or class . but this will take more time for each new site I add . as I'm Thinking to hundreds of sites to it .
    in short this is my code

    Code:
    $html = new simple_html_dom();
    $html->load($target_url);
    $divContent =$html->find('div');
    
    foreach($divContent as $e)  {
    $Co=(utf8(preg_replace('#<[^>]+>#','',$e->outertext )));
    $Co = preg_replace(array('/\s{2,}/', '/[\t\n]/'), ' ', $Co);
    if ( !rep($Co) AND strlen($Co)>90) 
     {     {  if (strlen($Co)>90)   
           {  echo '<h1 class="h" > '.$h++.'</h1>';   
             if ($e->id) echo 'ID Is :-  '.(utf8($e->id)).'<br/>'.PHP_EOL;
             if ($e->class) echo'<br/>Class Is :- '.(utf8($e->class)).'<br/>'.PHP_EOL;
             echo'<div class="con"><h2>Content is</h2>'.$Co.'</div>';
               }
            }   }  
    ^^ I reach that codes after long search to put best suitable codes .. but it's not finished yet and I'll show you why .
    This example of X site have unknown dive contents Id .

    Code:
    <div id="01">
    <h3>The standard Lorem Ipsum passage, used since the 1500s</h3>
    <div class="02">
    <p>"Lorem ipsum dolor sit amet, consectetur adipiscing elit,
    sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. . "</p>
    </div>
    Section 1.10.32 of "de Finibus Bonorum et Malorum", written by Cicero in 45 BC
    <div id="03">
    <div id="03-1">
    ads
    </div>
    "Sed ut perspiciatis unde omnis iste natus error sit voluptatem
     accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo
    </div>
    </div>
    The script in first loop will strip html codes and return pain text for dive 01 and all remaining div's .
    Second loop will return div02 plain-text . third loop will return div-03 and div-03-1 texts .
    here Draw for the same code
    http://s16.postimg.org/c2ixl33np/rect3399.png
    That was simple example in real test it return me with +50 Div from all page contents .

    I cant post this question on stackoverflow as they close every BH related questions . and I don't have many account with others web forums .
     
    Last edited: Jul 2, 2015
  2. bretonel

    bretonel Junior Member

    Joined:
    Jun 27, 2015
    Messages:
    125
    Likes Received:
    15
    Occupation:
    programmer
    Location:
    the inter nets
    Home Page:
    php might not be the best tool for this job.
    try node js maybe? (in zoidberg's voice)
     
  3. mainceaft

    mainceaft Regular Member

    Joined:
    Apr 10, 2013
    Messages:
    379
    Likes Received:
    39
    in fact I know nothing about node.js and sure it will take alot of time until I learn it . so I'll stick with what I got .
    any way I think the solution is I should separate the Div inside an Array without Simple_HTM_DOM
     
  4. bretonel

    bretonel Junior Member

    Joined:
    Jun 27, 2015
    Messages:
    125
    Likes Received:
    15
    Occupation:
    programmer
    Location:
    the inter nets
    Home Page:
    You don't need to post this question to SO as a BH question. It's a general programming question.
    The thing is it seems you need a more in depth knowledge of your tools.
     
  5. bretonel

    bretonel Junior Member

    Joined:
    Jun 27, 2015
    Messages:
    125
    Likes Received:
    15
    Occupation:
    programmer
    Location:
    the inter nets
    Home Page:
    For example i see that you're extracting the contents of your div using regular expressions. This is kinda of a no-no in the world of html parsing.
    You need to find the right function call to extract the innerhtml and the outerhtml from your tag.
    Like $element->ownerDocument->saveHTML($child) ...
     
  6. Diplomat

    Diplomat Jr. VIP Jr. VIP

    Joined:
    Oct 25, 2011
    Messages:
    948
    Likes Received:
    440
    Home Page:
    Yeah, PHP isn't the best for a good enough web crawler. If you just use it for an app to analyze a page then it's fine, but if I'd be you I'd go with Python and BeautifulSoup. It's super easy.
     
  7. mainceaft

    mainceaft Regular Member

    Joined:
    Apr 10, 2013
    Messages:
    379
    Likes Received:
    39
    I post it there . and they Close It .sometimes they are so rude .
    I use regular expressions because the internal function remove all white spaces and and squeezes letters together > I already tried $e->plaintext and I needed to use another function to remove useless White Spaces Like this :-



    SIMPLE_HTM_DOM are easy too .this issue is not related to it , but to Strings and arrays .
    any way I here is sample of HTML source page similar to what I'm taking about .
    the real HTMl page
    http://jsfiddle.net/zumLnwzq/
    the script result .
    http://jsfiddle.net/tknwyep4/
     
  8. ekapek

    ekapek Jr. VIP Jr. VIP Premium Member

    Joined:
    Aug 2, 2010
    Messages:
    266
    Likes Received:
    47
    Home Page:
    Don't use regex for parsing html contents - better use xpath or another parsing library. You will never has good results for finding div contents in such way. The best for you will be use text extarction alghoritms like https://code.google.com/p/boilerpipe/
     
  9. jamie3000

    jamie3000 Supreme Member

    Joined:
    Jun 30, 2014
    Messages:
    1,305
    Likes Received:
    586
    Occupation:
    Finance coder looking for semi-retirement
    Location:
    uk
    Learn xpath its super powerful for HTML traversal and data extraction
     
  10. premo

    premo Newbie

    Joined:
    Mar 6, 2015
    Messages:
    1
    Likes Received:
    0
    hey,

    I do this stuff all the time. I don't have 15 post yet so I can't send you a pm but if you can send me one or give me a way to contact you, I will write it for you and send you the script. If you provide the website and what content specifically you're trying to get then I can assist.
     
  11. Cloakd

    Cloakd Newbie

    Joined:
    Jun 17, 2014
    Messages:
    5
    Likes Received:
    1
    if done right PHP works perfectly for crawling websites. As Jamie said learn XPath and implement it then run async workers on the PHP script.
     
  12. nocare

    nocare Junior Member

    Joined:
    Apr 29, 2013
    Messages:
    164
    Likes Received:
    81
    Location:
    Deep Code
    I recently used PhpQuery for a project. Lets you use css selectors to grab things in much the same way you would with jQuery and while I still had to get into DomDocument for some things, I found it to be very nice to work with.