1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

WebCrawler project

Discussion in 'C, C++, C#' started by Chonchonts, Jul 6, 2012.

  1. Chonchonts

    Chonchonts Newbie

    Joined:
    Jul 2, 2012
    Messages:
    9
    Likes Received:
    0
    Hi,

    Currently, I work on a webcrawler tool made in C#(a part of my big project of SEO tools), yes I prefer make my tool, it's funnier.

    Properties:

    - crawl on 1 or more web sites.
    - save datas in SQLite database, for nice SQL requests.
    - statistics and datas panel
    - user can save urls, mails, images, forms, a specific markup or a custom harshvest with regex.
    - analyze forms for automatic submit by user
    - save metadata
    - proxy switch
    - timeout
    - threads
    - export datas in XML, CSV, plain text, HTML.

    Do you have any suggestions?
    I want to improve it.
    Web crawlers has been used always by black hat SEO guys now ?
     
  2. Debugger

    Debugger Junior Member

    Joined:
    Aug 16, 2009
    Messages:
    174
    Likes Received:
    34
    Location:
    India
    can you post basic code for webcrawler...That would be very helpful..
     
  3. Chonchonts

    Chonchonts Newbie

    Joined:
    Jul 2, 2012
    Messages:
    9
    Likes Received:
    0
    Yes, I can write a little tutorial for make a basic WebCrawler and suggest useful tools and tips ^^ (and others tutorials why not, I will think about that).
    But I will not post my complete webcrawler sorry :).
     
  4. theMagicNumber

    theMagicNumber Regular Member

    Joined:
    May 13, 2010
    Messages:
    345
    Likes Received:
    195
    You can get some ideas from:
    http://ncrawler.codeplex.com/
    ncrawler is working OK, but there are some bugs in it, however the class structure seems well written.
     
  5. Chonchonts

    Chonchonts Newbie

    Joined:
    Jul 2, 2012
    Messages:
    9
    Likes Received:
    0
    Nice !
    Pipeline idea, extract text to pdf files and filters are good.
    Also I will think to FTP crawler and multimedia document metadata extractor.
     
  6. Chonchonts

    Chonchonts Newbie

    Joined:
    Jul 2, 2012
    Messages:
    9
    Likes Received:
    0
    Hi !
    I reup this thread because I have stop my project from years, but I have restart it !
    I have implemented all this one:
    - Extract a very large kind of things in any webpage: links, images, sounds, emails, proxies, PDF, Words, Paragraph, Discogs links.
    - Extract tweets.
    - Use custom Regexes, XPath and CSS path to extract others datas.
    - Download all this in one click.
    - String machine: add/remove/replace chars in string, sort by length, by number of occurences, mapping the strings (http links to html tag ahref, generate youtube iframe,...), check if strings contains chars, remove html tags/stop words in text, trim to root the urls, use custom conditions to select strings, etc...
    - Random machine: generate random things, like number, zip code, country, names, age, etc...
    - In progress in String Machine: maps texts to spinned text (with WordNet).

    I am specialized in Data Mining and Image processing, so I project to include that in my webcrawler, do you think is useful from SEO?
    For exemple:
    - Extract key words for webpage, or general concepts.
    - Find categories/famous keywords for explain a PDF/Doc document.
    - Group similar webpages.
    - Detect sentiments in text.
    - Detect objects in pictures (faces, car, etc...), or classify them.

    I want to sell them when a first version is finish, how cost this kind of bot (with this features)?
     
  7. rootjazz

    rootjazz Jr. VIP Jr. VIP

    Joined:
    Dec 21, 2012
    Messages:
    614
    Likes Received:
    313
    Occupation:
    Developer
    Location:
    UK
    Home Page:
    Your webcrawler is processing javascript? Bit of a must for these days in my opinion.

    As for selling it, a lot depends. Who are you targeting? Beginner users or technical? If technical, they will appreciate the more advanced features, but well, then they could could it themselves, use an open source solution.

    You could go for novices, and your USP is providing great support in getting them started with web crawling. But people will only pay X dollars if they think they will get X + Y dollars back. So you need to figure out how people will make money from your crawler and then find those users and sell to them.



    If I am honest, it seems you are making the program from an enjoyment point of view, without too much of a monetisation point of view. Which is fine of course. If the goal is to build it first and foremost and sell it as an aside - great. If the goal is to sell it and make it profitable. Sounds like you have a lot of research todo, to find out what features potential customers want first regardless of what you want to add.


    As developers, I think most of us fall foul of this and code features we want to code, regardless of
    a) will this feature make sales
    b) will customers use this feature


    Perhaps try and find out the competition of paid web crawlers, check the freelance sites to see what web crawling software projects there are and see what come up time and time again. Make this easy to do, then you have your market. Bid on the projects, complete them easily and cheaply, the push your sales pitch how these people could use your software instead of using freelancers
     
  8. Chonchonts

    Chonchonts Newbie

    Joined:
    Jul 2, 2012
    Messages:
    9
    Likes Received:
    0
    Thanks for all the tips, I am beginning to see the other jobs in webcrawling to adapt my features :p.

    My webcrawler can crawl html fast with WebClient in C# (so if your don't care about javascript) or activate javascript processing (with Awesomium).
    I have also a section to make automated actions with Selenium (open browers, clicks, fill inputtext, navigate, etc...) if the user what to make specific tasks with visual outputs :p.
     
  9. revproxy

    revproxy Jr. VIP Jr. VIP Premium Member

    Joined:
    Nov 20, 2015
    Messages:
    330
    Likes Received:
    86
    Gender:
    Male
    Occupation:
    Developer, Software Architect
    Home Page:
    If you can mess with C++...
    I wrote great crawlers at my work with QtWebKit...
    you can extend & hack real browser - you can learn from Phantomjs code in github
     
  10. NullReferenceX

    NullReferenceX Newbie

    Joined:
    Dec 1, 2015
    Messages:
    41
    Likes Received:
    82
    Occupation:
    Programmer
    Location:
    Germany
    I would not advice to try to process any javascript but just go for pure sockets implementation. WebClient itself is messy and has to much overhead to be usefull for any SEO tool. You can cut away about 40-60% cpu usage this way, WebClient just is not designed for scalability. Also don't try to cut corners with packages like Chillkat Http, it will cause more pain in the end. I personally also find that chilkat has a bad support for error handeling.

    Just spend a few day's on implementing HTTP on top of Sockets and you will have a library with the fastest execution time and the smallest footprint.
     
    • Thanks Thanks x 1
  11. 9to5destroyer

    9to5destroyer Jr. VIP Jr. VIP Premium Member

    Joined:
    Nov 14, 2011
    Messages:
    355
    Likes Received:
    205
    what issues have you had with chilkat out of interest, I agree the error handling isn't great but I used it for a wordpress project the other day and it seems a lot of the old annoying bugs have gone mostly, had to implement my own custom timeout and page size check thou
     
  12. NullReferenceX

    NullReferenceX Newbie

    Joined:
    Dec 1, 2015
    Messages:
    41
    Likes Received:
    82
    Occupation:
    Programmer
    Location:
    Germany
    Over time i had allot of issues with Chilkat not working when a cookie header had a ; in it, random AccesViolationException crashes that they fixed with new updates.

    About a week ago i made a facebook tool for a client and used chilkat, there was everytime a exception that stated that the unmanaged memory in chilkat was corrupted. So i contacted support and they told me they would look into it and update could take 4-7 day's. Then i felt it was time to roll my own and to be honest it was very simple if you just stick to the RFC's.

    But the error handeling in chilkat bothers me the most of all, it's bulky and not helpful at all during run time. While your developing it's nice but you want your apps to take decisions based on a error code or the type of exception you get. With chilkat you have to parse there log to find out what error you have gotten... Could be a proxy not working or whatever, but the extra code you need to put around it to get it out of there i find very frustrating myself. Plus the log files can use a ton of memory when running many parallel threads and by disabling them you save memory but can't trouble shoot during execution anymore.

    My two cents on Chilkat.
     
  13. Dev Warrior

    Dev Warrior Jr. VIP Jr. VIP Premium Member

    Joined:
    Oct 13, 2015
    Messages:
    194
    Likes Received:
    26
    Home Page:
    I would suggest to add a feature to add Javascript rendering because most of the websites now a days renders data on client side using javascript!
     
  14. ToxicBlack

    ToxicBlack Regular Member

    Joined:
    Mar 25, 2016
    Messages:
    223
    Likes Received:
    54
    Occupation:
    Programming custom bots and tools.
    Location:
    botland
    Yeah Javascript rendering is a must for today's tools.

    You can use Selenium as a bridge between real browsers Firefox/Chrome or headless PhantomJS.
     
  15. Dev Warrior

    Dev Warrior Jr. VIP Jr. VIP Premium Member

    Joined:
    Oct 13, 2015
    Messages:
    194
    Likes Received:
    26
    Home Page:

    Nice suggestions, however I prefer PhantomJS.. :)
     
  16. ToxicBlack

    ToxicBlack Regular Member

    Joined:
    Mar 25, 2016
    Messages:
    223
    Likes Received:
    54
    Occupation:
    Programming custom bots and tools.
    Location:
    botland
    Well it depends, I don't have a lot of real-world experience with it, but some people had problems being "caught" with phantomJS. Well it depends what you need to do with it.
     
  17. Dev Warrior

    Dev Warrior Jr. VIP Jr. VIP Premium Member

    Joined:
    Oct 13, 2015
    Messages:
    194
    Likes Received:
    26
    Home Page:
    Randomizing the user agents string will NOT leave some sort of footprints by the software while crawling!