1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Which language for spider/crawler

Discussion in 'General Programming Chat' started by Flusel, Sep 14, 2012.

  1. Flusel

    Flusel Newbie

    Joined:
    Mar 20, 2010
    Messages:
    1
    Likes Received:
    0
    Hi,

    I need some advice in which language to choose for programming a spider/crawler. I need some basic functionality like URL parsing and remote file access and I want to do it with multithreading. I know Java has some Issues regarding URL Parsing functions and multithreading.
     
  2. jazzc

    jazzc Moderator Staff Member Moderator Jr. VIP

    Joined:
    Jan 27, 2009
    Messages:
    2,468
    Likes Received:
    10,155
    Any modern language can cope with that (but it 's harder to do with PHP due to lack of proper threading). Python, Java, Ruby even C++ are good choices.
     
  3. jamb0ss

    jamb0ss Junior Member

    Joined:
    Feb 9, 2012
    Messages:
    125
    Likes Received:
    45
    Occupation:
    Bots programming
    Python
     
  4. innozemec

    innozemec Jr. VIP Jr. VIP

    Joined:
    Aug 19, 2011
    Messages:
    5,290
    Likes Received:
    1,799
    Location:
    www.Indexification.com
    Home Page:
  5. fatboy

    fatboy Elite Member

    Joined:
    Aug 13, 2008
    Messages:
    1,618
    Likes Received:
    3,227
    Occupation:
    Retired
    Location:
    Old Peoples Home
    How come no one mentions Perl anymore :D
    Go Perl, has all the modules you will need and does multithreading.....nice easy scripting.

    Who needs these new fangled languages when you have Perl :D

    By the way - I suggest Perl :D
     
  6. seo-dude

    seo-dude BANNED BANNED

    Joined:
    Sep 4, 2012
    Messages:
    147
    Likes Received:
    56
    C++ because of performance..
     
  7. alternatesword

    alternatesword Jr. VIP Jr. VIP

    Joined:
    Aug 25, 2012
    Messages:
    2,324
    Likes Received:
    484
    Location:
    scabbard
    Home Page:
    Go with Perl. It has all the modules.
     
    • Thanks Thanks x 1
  8. cgimaster

    cgimaster Power Member

    Joined:
    Jun 30, 2012
    Messages:
    525
    Likes Received:
    311
    Gender:
    Male
    Flusel, don't think java have issues, I have used it with htmlunit and it worked just as expected.

    You can scrap with most programming and scripting languages this days some will give you best performance, other readability it really depends on how you make it, to what youre going to use it.

    If I had to recommend a language to you I would say C# or PHP they are more commonly used, have lots of resource, tutorials and other things available for learning including but not limited in regards crawling.

    C# have libraries like HTMLAgilityPack and others for parsing HTML at easy, libraries like JINT for javascript, you can also implement multithreaded browsers for contents that need it and there is a lot more to it, it also does well with multithreading.

    PHP can be used for multithreaded too, it is a different way as spawning childs but works very well and stable.

    It all depends on the knowledge of the developer on the tool he is using to ensure that the code is well made to its usage.
     
  9. lisper

    lisper Newbie

    Joined:
    Aug 23, 2012
    Messages:
    44
    Likes Received:
    24
    Occupation:
    Lead developer of some German research project
    Location:
    Currently Brussels, Belgium
    How come no one mentions lisp anymore
    clj-html, has all the modules you will need and does multithreading.....nice easy scripting.

    Who needs these new fangled languages when you have lisp

    By the way - I suggest lisp
     
  10. jazzc

    jazzc Moderator Staff Member Moderator Jr. VIP

    Joined:
    Jan 27, 2009
    Messages:
    2,468
    Likes Received:
    10,155
    Code:
    (DEFUN BECAUSE ()  "TOO MANY ()")
    :D j/k I know nothing about it. :eek: Only time I used lisp was about two decades ago in Autocad to make a triangulation function :)
     
    • Thanks Thanks x 1
  11. lisper

    lisper Newbie

    Joined:
    Aug 23, 2012
    Messages:
    44
    Likes Received:
    24
    Occupation:
    Lead developer of some German research project
    Location:
    Currently Brussels, Belgium
    Hahaha! That just made my day. Valid syntax makes it even better lol.

    In all seriousness, I never understood the parenthesis hate to be honest... Maybe I just got used to them, not sure haha.
    I've switched to clojure a year or so ago (not because of the reduced parenthesis mind you) but find myself going back to SBCL every now and then (we also got some codebase back at work running on lisp as well so I'm happily hacking away).

    Autocad + lisp sounds like good fun btw.
     
    • Thanks Thanks x 1
  12. CodingAndStuff

    CodingAndStuff Regular Member

    Joined:
    May 6, 2012
    Messages:
    236
    Likes Received:
    84
    Occupation:
    Swagstronaut
    Location:
    You can't have my bots. Sorry :'(
    I'd suggest using Python for any http automation. It uses barely any resources, and has a few nice libraries that update the response buffer in real time (in case content is updated via an XHR). It also will run on any operating system (which means you can host it on your home connection, your server, etc).
     
  13. botrockets

    botrockets Regular Member

    Joined:
    Mar 16, 2013
    Messages:
    272
    Likes Received:
    463
    Occupation:
    Software Developer
    Location:
    Saint T.N.
    You can create that in javascript and run in any browser.. that would be easiest way..
    Other options are C#/Python/iMacro blah blah
     
  14. inviz

    inviz Newbie

    Joined:
    Jun 15, 2010
    Messages:
    45
    Likes Received:
    5
    node.js, enuff said;)
     
  15. calyx239

    calyx239 Newbie

    Joined:
    Feb 11, 2013
    Messages:
    16
    Likes Received:
    2
    Perl is the king of this field, and always will be. The language was designed for this sort of thing.
     
  16. mralexander

    mralexander Newbie

    Joined:
    Dec 10, 2011
    Messages:
    18
    Likes Received:
    4
    Occupation:
    webmaster
    Location:
    world
    for doing it fast and dirty - php
    for speed use python + libs
     
  17. seriousjoker

    seriousjoker Registered Member

    Joined:
    Feb 12, 2013
    Messages:
    75
    Likes Received:
    33
    Java or c#.Java has no problem either with multithreading or scraping. New fork framework, with usual thread, of java is more than enough you need for concurrency. There are very few languages which can provide better multithreading than java, erlang is one of them.
    For scraping there is jsoup and for web automation you can use selenium, which can drive original firefox, ie or chrome. As it drives your original explorer it is very accurate and consistent with complex ajax calls and javascript.
    Other benifit of java's are platform independent codes, sure you need to twik your codes a bit, very mature library and robust language.
     
  18. raatra

    raatra Newbie

    Joined:
    Jan 7, 2013
    Messages:
    12
    Likes Received:
    1
    php, python good for this purpose.
     
  19. The Real Red

    The Real Red Newbie

    Joined:
    Apr 3, 2013
    Messages:
    26
    Likes Received:
    2
    Please check out this book by sybex.
    docsgoogle/file/d/0B39j2WChNGOuSExfb0ZNLWxRUHc/edit

    This is my personal file on my google drive, feel free to check em out
     
  20. NadiHassan

    NadiHassan Registered Member

    Joined:
    Jul 15, 2010
    Messages:
    55
    Likes Received:
    5
    Depends on the approach - You can do it in any lang. Theoretically speaking , but the discussion is which is better :

    You can Thread PHP by spawning children - The advantage of php is , well , simplicity if you are going to tie it up with MYSQL

    Java works as a better language , and if you are already using JVM then the performance is well. Threading and connections are all well - but you need some info on how to optimize it (SQL pools , thread caches )

    C# is the same as java , but has a better regex library - How ever if you are scaling and running from a VPS , drop this one - icky MS tech

    Python - Simple and effective , didnt try it , but most crawlers are moving to it. Ask someone else

    C++ - Effective and more controllable - If you know your way around it- Advanced stuff mainly

    Javascript : An option , but not too well given it's security policies .

    Ruby : No Threads - some other alt called green threads.

    Note : You can always replace threads by a queuing system some way or another

    My 2 cents :) any help needed PM me