<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-4135307179538838010</id><updated>2011-11-27T16:05:49.623-08:00</updated><title type='text'>Crawl The Web</title><subtitle type='html'>This blog is mainly for developers who are searching for materials that help them doing Web Crawling, I'll try to list everything needed for that.

examples and tools illustration will be provided</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://crawltheweb.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4135307179538838010/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://crawltheweb.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Mostafa Siraj</name><uri>http://www.blogger.com/profile/18276443600709227885</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='30' height='32' src='http://1.bp.blogspot.com/_x4LAhW8QREc/SN_4UUBiAlI/AAAAAAAAAXY/KR_LaYV0UNE/S220/Image(416).jpg'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>2</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-4135307179538838010.post-8521716323097264563</id><published>2009-05-03T01:05:00.000-07:00</published><updated>2009-05-03T01:16:38.790-07:00</updated><title type='text'>Online Business &amp; Custom Search Engines Session for CuttingEdge (29th April, 2009)</title><content type='html'>I made a session about "Online Business" and "Custom Search Engines" for CuttingEdge club, the attendees were really interactive during the whole session, unfortunately I didn't have the opportunity to know the feedback of the session from them hopully if you attended the session and viewing my blog now you'll leave me a note :)&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;here is the material&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;h3&gt;Online Business&lt;/h3&gt;&lt;br /&gt;&lt;iframe src="http://docs.google.com/EmbedSlideshow?docid=dcn76rt7_943dfv7v3fw" frameborder="0" width="410" height="342"&gt;&lt;/iframe&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Custom Search Engines&lt;/h3&gt;&lt;br /&gt;&lt;iframe src='http://docs.google.com/EmbedSlideshow?docid=dcn76rt7_957fd5dg4gt' frameborder='0' width='410' height='342'&gt;&lt;/iframe&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4135307179538838010-8521716323097264563?l=crawltheweb.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://crawltheweb.blogspot.com/feeds/8521716323097264563/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4135307179538838010&amp;postID=8521716323097264563' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4135307179538838010/posts/default/8521716323097264563'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4135307179538838010/posts/default/8521716323097264563'/><link rel='alternate' type='text/html' href='http://crawltheweb.blogspot.com/2009/05/online-business-custom-search-engines.html' title='Online Business &amp; Custom Search Engines Session for CuttingEdge (29th April, 2009)'/><author><name>Mostafa Siraj</name><uri>http://www.blogger.com/profile/18276443600709227885</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='30' height='32' src='http://1.bp.blogspot.com/_x4LAhW8QREc/SN_4UUBiAlI/AAAAAAAAAXY/KR_LaYV0UNE/S220/Image(416).jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4135307179538838010.post-64733397591913899</id><published>2008-09-29T05:53:00.000-07:00</published><updated>2008-10-22T06:24:25.797-07:00</updated><title type='text'>Introduction to Web Crawling</title><content type='html'>&lt;span style="font-weight: bold; color: rgb(51, 102, 255);font-size:180%;" &gt;&lt;br /&gt;Why this post is so important?&lt;/span&gt;&lt;br /&gt;&lt;script type="text/javascript"&lt;br /&gt;src="http://pagead2.googlesyndication.com/pagead/show_ads.js"&gt;&lt;br /&gt;&lt;/script&gt;&lt;br /&gt;&lt;script type="text/javascript"&gt;&lt;br /&gt;var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");&lt;br /&gt;document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));&lt;br /&gt;&lt;/script&gt;&lt;br /&gt;&lt;script type="text/javascript"&gt;&lt;br /&gt;var pageTracker = _gat._getTracker("UA-5492887-2");&lt;br /&gt;pageTracker._trackPageview();&lt;br /&gt;&lt;/script&gt;&lt;br /&gt;&lt;p&gt;&lt;br /&gt;Actually I'm not going to tell why do we need to learn web crawling or how much stuff that could be done and money that could be gained if you can develop a good crawler, because I believe that since you are here then you  need to crawl and since you are here it doesn't mean that I'm one of the best bloggers around here but it means that THERE IS NO HELP IN THIS TOPIC. that's why this blog is here because people need web crawling and they can't do it because there is no help. I started searching for some material -just like you- and I found almost nothing so I started to do it by my own and here I am publishing what I know for everyone.&lt;/p&gt;&lt;span style="font-weight: bold; color: rgb(51, 102, 255);font-size:180%;" &gt;What are my resources&lt;/span&gt;&lt;span style="font-size:180%;"&gt;&lt;span style="font-weight: bold; color: rgb(51, 102, 255);"&gt;?&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;as I said there aren't many resources, however I list here the few resources I found and used&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;span style="font-weight: bold; color: rgb(0, 0, 0);font-size:130%;" &gt;1-Book: "&lt;a style="color: rgb(0, 0, 0);" href="http://www.heatonresearch.com/book/http-programming-csharp.html"&gt;Http Recipes For C# Bots&lt;/a&gt;" by Jeff Heaton&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This is maybe the only book that talks about Web Crawling from a developer point of view, however I believe that it doesn't go deep enough in order to push you to the real work. I read that book several times and always was able to access his website by crawlers I wrote, however I wasn't able to access other websites because as I stated earlier the book doesn't go deep enough&lt;/p&gt;&lt;p&gt;&lt;br /&gt;&lt;a style="color: rgb(0, 0, 0);" href="http://www.ideabubbling.com/Article1.aspx"&gt;&lt;span style="color: rgb(0, 0, 0);font-size:130%;" &gt;&lt;span style="font-weight: bold;"&gt;2-Web Article: Tools for access site .NET&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;This is maybe the only post I found that talks about real crawlers development, it list a great tools for crawler development, beside describing one crawler that logins to Yahoo Address -with source code included- however in a very superficial way. no description about using the listed tools, no code snippets were used. I use most of the tools listed at that post and they are all very beneficial, I'll try at my blog to list each tool and how to deal with in more details. in order to get you on the road very fast&lt;/p&gt;&lt;br /&gt;&lt;a style="color: rgb(0, 0, 0);" href="http://www.beansoftware.com/NET-Tutorials/Web-Crawler.aspx"&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;3-Web Crawler, spider, ant, bot... how to make one?&lt;/span&gt;&lt;/a&gt;&lt;br /&gt;&lt;p&gt;Another interesting article that gives a complete crawler example in VB.NET, again I don't think the article can help you write your own crawler for your own purpose but reading it is still beneficial as there isn't a lot of resources like I stated before&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;br /&gt;&lt;span style="font-weight: bold; color: rgb(0, 0, 0);font-size:130%;" &gt;&lt;a href="http://www.getafreelancer.com/users/feedback_354604.html"&gt;4-My own experience&lt;/a&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I work as a Software Engineer in &lt;a href="http://www.itworx.com/"&gt;ITWorx&lt;/a&gt; besides being a &lt;a href="http://www.getafreelancer.com/users/354604.html"&gt;freelancer&lt;/a&gt; I developed several crawlers for many websites and here is a list of the recent crawlers I made&lt;br /&gt;&lt;br /&gt;1-Yahoo Answers Crawler&lt;br /&gt;2-People search crawler that is gathering information from&lt;br /&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Yahoo People&lt;/li&gt;&lt;li&gt;Lycos People Search&lt;/li&gt;&lt;li&gt;Peopledata.com&lt;/li&gt;&lt;li&gt;Superpages.com&lt;/li&gt;&lt;/ul&gt;3- &lt;a href="http://www.getafreelancer.com/projects/PHP-ASP/Find-Script-Injections.html"&gt;Script Enjection finder&lt;/a&gt; in which the crawler scan list of websites for injected scripts -Cross Site Scripting-&lt;br /&gt;&lt;br /&gt;and lots of other crawlers that I built -you can check my recents freelancing projects &lt;a href="http://www.getafreelancer.com/users/feedback_354604.html"&gt;here&lt;/a&gt;-, I believe that sharing my experiences in this field will be very benificial to you -that's why I'm writing this blog-&lt;br /&gt;&lt;p&gt;&lt;/p&gt;&lt;span style="font-weight: bold; color: rgb(51, 102, 255);font-size:180%;" &gt;What are the tools needed to write your own crawler?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;actually I recommend all the tools listed at "&lt;a href="http://www.ideabubbling.com/Article1.aspx"&gt;Tools for access site .NET&lt;/a&gt;" beside another few tools&lt;br /&gt;here they are for completeness&lt;br /&gt;&lt;br /&gt;&lt;div&gt;                                 1. &lt;a href="http://www.mozilla.com/" target="_blank"&gt;Mozilla's FireFox&lt;/a&gt;   -the browser that your bot will simulate-&lt;br /&gt;&lt;/div&gt;                             &lt;div&gt;                                 2. &lt;a href="http://www.fiddlertool.com/" target="_blank"&gt;Microsoft's Fiddler&lt;/a&gt; -Network analysis tool-&lt;br /&gt;&lt;/div&gt;                             &lt;div&gt;                                 3. &lt;a href="http://www.radsoftware.com.au/"&gt;RAD Software's Regular Expression                                     Designer&lt;/a&gt; -for extracting important data from web pages&lt;br /&gt;&lt;/div&gt;                             &lt;div&gt;                                 4. &lt;a href="http://www.ccleaner.com/" target="_blank"&gt;Piriform's CCleaner &lt;/a&gt;-for clearing out your cookies (it's not that important)&lt;br /&gt;&lt;/div&gt;                             &lt;div&gt;                                 5. &lt;a href="https://addons.mozilla.org/en-US/firefox/addon/60" target="_blank"&gt;                                     Mozilla's Firefox Addon: Web Developer&lt;/a&gt; -will helps you in analyzing the pages&lt;br /&gt;&lt;/div&gt;                             &lt;div&gt;                                 6. &lt;a href="https://addons.mozilla.org/en-US/firefox/addon/1843" target="_blank"&gt;                                     Mozilla's Firefox Addon: Firebug&lt;/a&gt; -actually I didn't use it during my work-&lt;br /&gt;&lt;/div&gt;beside these tools (my suggsetions)&lt;br /&gt;7. &lt;a href="http://www.wireshark.org/"&gt;Wireshark&lt;/a&gt;  -another network analysis tool. I'll state later why I would need two network analysis programs-&lt;br /&gt;8. &lt;a href="http://msdn.microsoft.com/en-us/vstudio/products/aa700831.aspx"&gt;Visual Studio 2005 or later&lt;/a&gt; -I'll primary use C# .NET 2.0 to build my crawlers and I may add later a java version of the crawlers I'll build-&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; color: rgb(51, 102, 255);font-size:180%;" &gt;Finally&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;please try to read all the resources and get all the tools I listed above in order to be ready for my later posts.&lt;br /&gt;&lt;br /&gt;comments are very welcomed for this post or anyone later I'll try to start writing my nexts posts as soon as I can.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4135307179538838010-64733397591913899?l=crawltheweb.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://crawltheweb.blogspot.com/feeds/64733397591913899/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4135307179538838010&amp;postID=64733397591913899' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4135307179538838010/posts/default/64733397591913899'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4135307179538838010/posts/default/64733397591913899'/><link rel='alternate' type='text/html' href='http://crawltheweb.blogspot.com/2008/09/introduction-to-web-crawling.html' title='Introduction to Web Crawling'/><author><name>Mostafa Siraj</name><uri>http://www.blogger.com/profile/18276443600709227885</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='30' height='32' src='http://1.bp.blogspot.com/_x4LAhW8QREc/SN_4UUBiAlI/AAAAAAAAAXY/KR_LaYV0UNE/S220/Image(416).jpg'/></author><thr:total>4</thr:total></entry></feed>
