WebCrawler project

Chonchonts · Jul 6, 2012

Hi,

Currently, I work on a webcrawler tool made in C#(a part of my big project of SEO tools), yes I prefer make my tool, it's funnier.

Properties:

- crawl on 1 or more web sites.
- save datas in SQLite database, for nice SQL requests.
- statistics and datas panel
- user can save urls, mails, images, forms, a specific markup or a custom harshvest with regex.
- analyze forms for automatic submit by user
- save metadata
- proxy switch
- timeout
- threads
- export datas in XML, CSV, plain text, HTML.

Do you have any suggestions?
I want to improve it.
Web crawlers has been used always by black hat SEO guys now ?

Debugger · Jul 8, 2012

can you post basic code for webcrawler...That would be very helpful..

Chonchonts · Jul 9, 2012

Yes, I can write a little tutorial for make a basic WebCrawler and suggest useful tools and tips ^^ (and others tutorials why not, I will think about that).
But I will not post my complete webcrawler sorry

.

theMagicNumber · Jul 9, 2012

You can get some ideas from:
http://ncrawler.codeplex.com/
ncrawler is working OK, but there are some bugs in it, however the class structure seems well written.

Chonchonts · Jul 9, 2012

Nice !
Pipeline idea, extract text to pdf files and filters are good.
Also I will think to FTP crawler and multimedia document metadata extractor.

Chonchonts · Aug 24, 2015

Hi !
I reup this thread because I have stop my project from years, but I have restart it !
I have implemented all this one:
- Extract a very large kind of things in any webpage: links, images, sounds, emails, proxies, PDF, Words, Paragraph, Discogs links.
- Extract tweets.
- Use custom Regexes, XPath and CSS path to extract others datas.
- Download all this in one click.
- String machine: add/remove/replace chars in string, sort by length, by number of occurences, mapping the strings (http links to html tag ahref, generate youtube iframe,...), check if strings contains chars, remove html tags/stop words in text, trim to root the urls, use custom conditions to select strings, etc...
- Random machine: generate random things, like number, zip code, country, names, age, etc...
- In progress in String Machine: maps texts to spinned text (with WordNet).

I am specialized in Data Mining and Image processing, so I project to include that in my webcrawler, do you think is useful from SEO?
For exemple:
- Extract key words for webpage, or general concepts.
- Find categories/famous keywords for explain a PDF/Doc document.
- Group similar webpages.
- Detect sentiments in text.
- Detect objects in pictures (faces, car, etc...), or classify them.

I want to sell them when a first version is finish, how cost this kind of bot (with this features)?

rootjazz · Aug 24, 2015

Your webcrawler is processing javascript? Bit of a must for these days in my opinion.

As for selling it, a lot depends. Who are you targeting? Beginner users or technical? If technical, they will appreciate the more advanced features, but well, then they could could it themselves, use an open source solution.

You could go for novices, and your USP is providing great support in getting them started with web crawling. But people will only pay X dollars if they think they will get X + Y dollars back. So you need to figure out how people will make money from your crawler and then find those users and sell to them.

If I am honest, it seems you are making the program from an enjoyment point of view, without too much of a monetisation point of view. Which is fine of course. If the goal is to build it first and foremost and sell it as an aside - great. If the goal is to sell it and make it profitable. Sounds like you have a lot of research todo, to find out what features potential customers want first regardless of what you want to add.

As developers, I think most of us fall foul of this and code features we want to code, regardless of
a) will this feature make sales
b) will customers use this feature

Perhaps try and find out the competition of paid web crawlers, check the freelance sites to see what web crawling software projects there are and see what come up time and time again. Make this easy to do, then you have your market. Bid on the projects, complete them easily and cheaply, the push your sales pitch how these people could use your software instead of using freelancers

Chonchonts · Aug 27, 2015

Thanks for all the tips, I am beginning to see the other jobs in webcrawling to adapt my features

.

My webcrawler can crawl html fast with WebClient in C# (so if your don't care about javascript) or activate javascript processing (with Awesomium).
I have also a section to make automated actions with Selenium (open browers, clicks, fill inputtext, navigate, etc...) if the user what to make specific tasks with visual outputs

.

revproxy · Nov 21, 2015

If you can mess with C++...
I wrote great crawlers at my work with QtWebKit...
you can extend & hack real browser - you can learn from Phantomjs code in github

NullReferenceX · Dec 1, 2015

I would not advice to try to process any javascript but just go for pure sockets implementation. WebClient itself is messy and has to much overhead to be usefull for any SEO tool. You can cut away about 40-60% cpu usage this way, WebClient just is not designed for scalability. Also don't try to cut corners with packages like Chillkat Http, it will cause more pain in the end. I personally also find that chilkat has a bad support for error handeling.

Just spend a few day's on implementing HTTP on top of Sockets and you will have a library with the fastest execution time and the smallest footprint.

9to5destroyer · Dec 4, 2015

NullReferenceX said:
I would not advice to try to process any javascript but just go for pure sockets implementation. WebClient itself is messy and has to much overhead to be usefull for any SEO tool. You can cut away about 40-60% cpu usage this way, WebClient just is not designed for scalability. Also don't try to cut corners with packages like Chillkat Http, it will cause more pain in the end. I personally also find that chilkat has a bad support for error handeling.

Just spend a few day's on implementing HTTP on top of Sockets and you will have a library with the fastest execution time and the smallest footprint.

what issues have you had with chilkat out of interest, I agree the error handling isn't great but I used it for a wordpress project the other day and it seems a lot of the old annoying bugs have gone mostly, had to implement my own custom timeout and page size check thou

NullReferenceX · Dec 4, 2015

Over time i had allot of issues with Chilkat not working when a cookie header had a ; in it, random AccesViolationException crashes that they fixed with new updates.

About a week ago i made a facebook tool for a client and used chilkat, there was everytime a exception that stated that the unmanaged memory in chilkat was corrupted. So i contacted support and they told me they would look into it and update could take 4-7 day's. Then i felt it was time to roll my own and to be honest it was very simple if you just stick to the RFC's.

But the error handeling in chilkat bothers me the most of all, it's bulky and not helpful at all during run time. While your developing it's nice but you want your apps to take decisions based on a error code or the type of exception you get. With chilkat you have to parse there log to find out what error you have gotten... Could be a proxy not working or whatever, but the extra code you need to put around it to get it out of there i find very frustrating myself. Plus the log files can use a ton of memory when running many parallel threads and by disabling them you save memory but can't trouble shoot during execution anymore.

My two cents on Chilkat.

Dev Warrior · Apr 12, 2016

I would suggest to add a feature to add Javascript rendering because most of the websites now a days renders data on client side using javascript!

ToxicBlack · Apr 19, 2016

Yeah Javascript rendering is a must for today's tools.

You can use Selenium as a bridge between real browsers Firefox/Chrome or headless PhantomJS.

Dev Warrior · Apr 19, 2016

ToxicBlack said:
Yeah Javascript rendering is a must for today's tools.

You can use Selenium as a bridge between real browsers Firefox/Chrome or headless PhantomJS.

Nice suggestions, however I prefer PhantomJS..

ToxicBlack · Apr 19, 2016

Well it depends, I don't have a lot of real-world experience with it, but some people had problems being "caught" with phantomJS. Well it depends what you need to do with it.

Dev Warrior · Apr 21, 2016

ToxicBlack said:
Well it depends, I don't have a lot of real-world experience with it, but some people had problems being "caught" with phantomJS. Well it depends what you need to do with it.

Randomizing the user agents string will NOT leave some sort of footprints by the software while crawling!

WebCrawler project

Chonchonts

Newbie

Debugger

Junior Member

Chonchonts

Newbie

theMagicNumber

Regular Member

Chonchonts

Newbie

Chonchonts

Newbie

rootjazz

Elite Member

Chonchonts

Newbie

revproxy

BANNED

NullReferenceX

Registered Member

9to5destroyer

Regular Member

NullReferenceX

Registered Member

Dev Warrior

Regular Member

ToxicBlack

Regular Member

Dev Warrior

Regular Member

ToxicBlack

Regular Member

Dev Warrior

Regular Member

Main Menu

Marketplace

Making Money

BlackHat World