Sunday, May 3, 2009
Online Business & Custom Search Engines Session for CuttingEdge (29th April, 2009)
Monday, September 29, 2008
Introduction to Web Crawling
Why this post is so important?
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
Actually I'm not going to tell why do we need to learn web crawling or how much stuff that could be done and money that could be gained if you can develop a good crawler, because I believe that since you are here then you need to crawl and since you are here it doesn't mean that I'm one of the best bloggers around here but it means that THERE IS NO HELP IN THIS TOPIC. that's why this blog is here because people need web crawling and they can't do it because there is no help. I started searching for some material -just like you- and I found almost nothing so I started to do it by my own and here I am publishing what I know for everyone.
as I said there aren't many resources, however I list here the few resources I found and used
1-Book: "Http Recipes For C# Bots" by Jeff Heaton
This is maybe the only book that talks about Web Crawling from a developer point of view, however I believe that it doesn't go deep enough in order to push you to the real work. I read that book several times and always was able to access his website by crawlers I wrote, however I wasn't able to access other websites because as I stated earlier the book doesn't go deep enough
2-Web Article: Tools for access site .NET
This is maybe the only post I found that talks about real crawlers development, it list a great tools for crawler development, beside describing one crawler that logins to Yahoo Address -with source code included- however in a very superficial way. no description about using the listed tools, no code snippets were used. I use most of the tools listed at that post and they are all very beneficial, I'll try at my blog to list each tool and how to deal with in more details. in order to get you on the road very fast
3-Web Crawler, spider, ant, bot... how to make one?
Another interesting article that gives a complete crawler example in VB.NET, again I don't think the article can help you write your own crawler for your own purpose but reading it is still beneficial as there isn't a lot of resources like I stated before
4-My own experience
I work as a Software Engineer in ITWorx besides being a freelancer I developed several crawlers for many websites and here is a list of the recent crawlers I made
1-Yahoo Answers Crawler
2-People search crawler that is gathering information from
- Yahoo People
- Lycos People Search
- Peopledata.com
- Superpages.com
and lots of other crawlers that I built -you can check my recents freelancing projects here-, I believe that sharing my experiences in this field will be very benificial to you -that's why I'm writing this blog-
What are the tools needed to write your own crawler?
actually I recommend all the tools listed at "Tools for access site .NET" beside another few tools
here they are for completeness
7. Wireshark -another network analysis tool. I'll state later why I would need two network analysis programs-
8. Visual Studio 2005 or later -I'll primary use C# .NET 2.0 to build my crawlers and I may add later a java version of the crawlers I'll build-
Finally
please try to read all the resources and get all the tools I listed above in order to be ready for my later posts.
comments are very welcomed for this post or anyone later I'll try to start writing my nexts posts as soon as I can.