Sunday, May 3, 2009

Online Business & Custom Search Engines Session for CuttingEdge (29th April, 2009)

I made a session about "Online Business" and "Custom Search Engines" for CuttingEdge club, the attendees were really interactive during the whole session, unfortunately I didn't have the opportunity to know the feedback of the session from them hopully if you attended the session and viewing my blog now you'll leave me a note :)

here is the material



Online Business




Custom Search Engines



Monday, September 29, 2008

Introduction to Web Crawling


Why this post is so important?





Actually I'm not going to tell why do we need to learn web crawling or how much stuff that could be done and money that could be gained if you can develop a good crawler, because I believe that since you are here then you need to crawl and since you are here it doesn't mean that I'm one of the best bloggers around here but it means that THERE IS NO HELP IN THIS TOPIC. that's why this blog is here because people need web crawling and they can't do it because there is no help. I started searching for some material -just like you- and I found almost nothing so I started to do it by my own and here I am publishing what I know for everyone.

What are my resources?

as I said there aren't many resources, however I list here the few resources I found and used

1-Book: "Http Recipes For C# Bots" by Jeff Heaton

This is maybe the only book that talks about Web Crawling from a developer point of view, however I believe that it doesn't go deep enough in order to push you to the real work. I read that book several times and always was able to access his website by crawlers I wrote, however I wasn't able to access other websites because as I stated earlier the book doesn't go deep enough


2-Web Article: Tools for access site .NET

This is maybe the only post I found that talks about real crawlers development, it list a great tools for crawler development, beside describing one crawler that logins to Yahoo Address -with source code included- however in a very superficial way. no description about using the listed tools, no code snippets were used. I use most of the tools listed at that post and they are all very beneficial, I'll try at my blog to list each tool and how to deal with in more details. in order to get you on the road very fast


3-Web Crawler, spider, ant, bot... how to make one?

Another interesting article that gives a complete crawler example in VB.NET, again I don't think the article can help you write your own crawler for your own purpose but reading it is still beneficial as there isn't a lot of resources like I stated before


4-My own experience

I work as a Software Engineer in ITWorx besides being a freelancer I developed several crawlers for many websites and here is a list of the recent crawlers I made

1-Yahoo Answers Crawler
2-People search crawler that is gathering information from

  • Yahoo People
  • Lycos People Search
  • Peopledata.com
  • Superpages.com
3- Script Enjection finder in which the crawler scan list of websites for injected scripts -Cross Site Scripting-

and lots of other crawlers that I built -you can check my recents freelancing projects here-, I believe that sharing my experiences in this field will be very benificial to you -that's why I'm writing this blog-

What are the tools needed to write your own crawler?

actually I recommend all the tools listed at "Tools for access site .NET" beside another few tools
here they are for completeness

1. Mozilla's FireFox -the browser that your bot will simulate-
2. Microsoft's Fiddler -Network analysis tool-
3. RAD Software's Regular Expression Designer -for extracting important data from web pages
4. Piriform's CCleaner -for clearing out your cookies (it's not that important)
5. Mozilla's Firefox Addon: Web Developer -will helps you in analyzing the pages
6. Mozilla's Firefox Addon: Firebug -actually I didn't use it during my work-
beside these tools (my suggsetions)
7. Wireshark -another network analysis tool. I'll state later why I would need two network analysis programs-
8. Visual Studio 2005 or later -I'll primary use C# .NET 2.0 to build my crawlers and I may add later a java version of the crawlers I'll build-

Finally

please try to read all the resources and get all the tools I listed above in order to be ready for my later posts.

comments are very welcomed for this post or anyone later I'll try to start writing my nexts posts as soon as I can.