I'm doing some projects to help webmasters maintain their sites. Some of these tools include: Link checker, Del.icio.us auto submitter, W3C HTML validator auto submitter, and many more. One of the components I need to build these tools is a web crawler/spider. This tool is needed to build a page index, which will be used by the higher level components to perform their tasks.
I've searched the Web for such a system, but I cannot find one that suits my needs. Many of the systems I found only crawl a single page, while I need a more complex crawler that crawls the entire site and has some extra options like: directory to include or skip, file type to include or skip, and such.
So I decided to write my own web crawler. The system is designed to crawl all pages from a given site and parse the title, keywords and descriptions tags from each page. The page's info and its URL then saved in a database for future use.
The main function of the system fetch a page and parse its tags. The code is shown below.
Listing 1: listing-1.php
The function is used to crawl and parse a single page. To crawl the entire site, it is used inside a loop that follows every links that found. The output of the function is an array that contains the url, title, keywords, description, md5 and links. This info then saved in a database.
Web crawler is the backbone of many useful applications. For example, by using only the URL, title, keywords and descriptions, you can build these interesting systems:
Any suggestion and comments for improvements are welcome.
Regular Expressions Cookbook
By: Jan Goyvaerts, Steven Levithan
This cookbook provides more than 100 recipes to help you crunch data and manipulate text with regular expressions. it offers step-by-step instructions for common tasks with Java, Perl, PHP, and many more.
Payne on Feb 24, 2009:
Rabin on Mar 20, 2009:
Hemachandran on Jun 2, 2009:
منتد on Jul 11, 2009:
Laura on Oct 29, 2009:
| Tel. | +62 31 8662872 +62 856 338 6017 |
| ICQ | 489571630 |
| Skype | dede_bl4ckheart |
| Yahoo | dede_bl4ckheart |
| nashruddin.amin |
Amit Cohen on Dec 14, 2008: