Page Indexing - Get The Title and Meta Tags From All of Your Pages

Nov 7, 2008 | Tags: PHP, HTTP, Regex | del.icio.us del.icio.us | digg Digg

I'm doing some projects to help webmasters maintain their sites. Some of these tools include: Link checker, Del.icio.us auto submitter, W3C HTML validator auto submitter, and many more. One of the components I need to build these tools is a web crawler/spider. This tool is needed to build a page index, which will be used by the higher level components to perform their tasks.

I've searched the Web for such a system, but I cannot find one that suits my needs. Many of the systems I found only crawl a single page, while I need a more complex crawler that crawls the entire site and has some extra options like: directory to include or skip, file type to include or skip, and such.

So I decided to write my own web crawler. The system is designed to crawl all pages from a given site and parse the title, keywords and descriptions tags from each page. The page's info and its URL then saved in a database for future use.

The main function of the system fetch a page and parse its tags. The code is shown below.

Listing 1: listing-1.php

  1. <?php
  2. function fetch_and_parse_page($url)
  3. {
  4.     $html = file_get_contents($url);
  5.  
  6.     /* get page's title */
  7.     preg_match("/<title>(.+)<\/title>/siU", $html, $matches);
  8.     $title = $matches[1];
  9.  
  10.     /* get page's keywords */
  11.     $re="<meta\s+name=['\"]??keywords['\"]??\s+content=['\"]??(.+)['\"]??\s*\/?>";
  12.     preg_match("/$re/siU", $html, $matches);
  13.     $keywords = $matches[1];
  14.  
  15.     /* get page's description */
  16.     $re="<meta\s+name=['\"]??description['\"]??\s+content=['\"]??(.+)['\"]??\s*\/?>";
  17.     preg_match("/$re/siU", $html, $matches);
  18.     $desc = $matches[1];
  19.  
  20.     /* parse links */
  21.     $re="<a\s[^>]*href\s*=\s*(['\"]??)([^'\">]*?)\\1[^>]*>(.*)<\/a>";
  22.     preg_match_all("/$re/siU", $html, $matches);
  23.     $links = $matches[2];
  24.  
  25.     $info = array(
  26.         "url"         => $url,
  27.         "title"       => $title,
  28.         "keywords"    => $keywords,
  29.         "description" => $desc,
  30.         "md5"         => md5($html),
  31.         "links"       => array_unique($links)
  32.     );    
  33.    
  34.     return($info);
  35. }
  36. ?>

The function is used to crawl and parse a single page. To crawl the entire site, it is used inside a loop that follows every links that found. The output of the function is an array that contains the url, title, keywords, description, md5 and links. This info then saved in a database.

Web crawler is the backbone of many useful applications. For example, by using only the URL, title, keywords and descriptions, you can build these interesting systems:

  • Generate RSS feed from online pages.
  • Validate the pages using W3C HTML Validator.
  • Auto submitter to del.icio.us.
  • Check for broken links.
  • And many more.

Any suggestion and comments for improvements are welcome.

Related Articles

Recommended Book

6 Comments

Amit Cohen on Dec 14, 2008:

Ok, nice I can see the benefits But, since I'm not a PHP expert, my Q is: Where do you define/put the URL for the site/page that you want to fetch info from? Copy, paste the code only gives you a blank page.. Amit

Payne on Feb 24, 2009:

Nice, Very nice and useful. Thanks

Rabin on Mar 20, 2009:

good work

Hemachandran on Jun 2, 2009:

Great work.... Thanks

منتد&# on Jul 11, 2009:

coooooooooool man

it's really nice

thank you

Laura on Oct 29, 2009:

Nice post on web crawler, simple and too the point For simple stuff i use python to web crawl, but for larger projects i used extractingdata.com http://www.extractingdata.com/website%20crawler.htm which worked great, they build custom web crawlers and data extracting programs

Leave a comment

Name (required)
Email (will not be published) (required)
Website

Characters left = 1000

Tags

Recent Posts

  1. OpenCV Utility: Reading Image Pixels Value
  2. OpenCV Circular ROI
  3. OpenCV 2.0 Installation on Windows XP and Visual Studio 2008
  4. Runtime ROI Selection using Mouse
  5. Real Time Eye Tracking and Blink Detection
View Archives

About the Author

avatar Cool PHP programmer writing cool PHP scripts. Feel free to contact
Tel. +62 31 8662872
+62 856 338 6017
ICQ 489571630
Skype dede_bl4ckheart
Yahoo dede_bl4ckheart
Google nashruddin.amin

Recommended Sites:

Hacker's HTTP Client
HTML and CSS Tutorials
Stop Dreaming Start Action
Online Quran and Translation