Create a PHP web crawler or scraper in 5 minutes

19 Jan

Utilizing the PHP programming language we show you how to create an infinitely extendable web crawler in under 5 minutes, collecting images and links.

The Crawler Framework

First we need to create the crawler class as follows:

<?php
class Crawler {

}

?>

We then will create methods to fetch the web pages markup, and to parse it for data that we are looking at collecting. The only public methods will be getMarkup() and get() as the parsing methods will generally be used privately for the crawler, however the visibility is set to protected since you never know who will want to extend its functionality.

<?php
class Crawler {

protected

$markup = '';

public function

__construct($uri) {

}

public function

getMarkup() {

}

public function

get($type) {

}

protected function

_get_images() {

}

protected function

_get_links() {

}
}

?>

Fetching Site Markup

The constructor will accept a URI so we can instantiate it such as new Crawler(‘http://vision-media.ca&#8217;); which then will set our $markup property using PHP’s file_get_contents() function which fetches the sites markup.

<?php
public function __construct($uri) {
$this->markup = $this->getMarkup($uri);
}

public function

getMarkup($uri) {
return
file_get_contents($uri);
}
?>

Crawling The Markup For Data

Our get() method will accept a $type string which essentially will simply be used to invoke another method actually doing the processing. As you can see below we construct the method name as a string, then make sure it is available so now developers can utilize this simply by invoking $crawl->get(‘images’);

We set visibility for _get_images() and _get_links() to protected so that developers will use our public get() method rather than getting confused and trying to invoke them directly.

Each protected data collection method simply uses the PCRE (Perl Compatible Regular Expressions) function preg_match_all() in order to return all tags within the markup that are accepted using our patterns of /<img([^>]+)\/>/i and /<a([^>]+)\>(.*?)\<\/a\>/i. For more information on regular expressions visit http://en.wikipedia.org/wiki/Regular_expression

<?php
public function get($type) {
$method = "_get_{$type}";
if (
method_exists($this, $method)){
return
call_user_method($method, $this);
}
}

protected function

_get_images() {
if (!empty(
$this->markup)){
preg_match_all('/<img([^>]+)\/>/i', $this->markup, $images);
return !empty(
$images[1]) ? $images[1] : FALSE;
}
}

protected function

_get_links() {
if (!empty(
$this->markup)){
preg_match_all('/<a([^>]+)\>(.*?)\<\/a\>/i', $this->markup, $links);
return !empty(
$links[1]) ? $links[1] : FALSE;
}
}
?>

Final PHP Web Crawler Code And Usage

<?php
class Crawler {

protected

$markup = '';

public function

__construct($uri) {
$this->markup = $this->getMarkup($uri);
}

public function

getMarkup($uri) {
return
file_get_contents($uri);
}

public function

get($type) {
$method = "_get_{$type}";
if (
method_exists($this, $method)){
return
call_user_method($method, $this);
}
}

protected function

_get_images() {
if (!empty(
$this->markup)){
preg_match_all('/<img([^>]+)\/>/i', $this->markup, $images);
return !empty(
$images[1]) ? $images[1] : FALSE;
}
}

protected function

_get_links() {
if (!empty(
$this->markup)){
preg_match_all('/<a([^>]+)\>(.*?)\<\/a\>/i', $this->markup, $links);
return !empty(
$links[1]) ? $links[1] : FALSE;
}
}
}

$crawl = new Crawler('http://vision-media.ca');
$images = $crawl->get('images');
$links = $crawl->get('links');
?>

Advertisements

3 Responses to “Create a PHP web crawler or scraper in 5 minutes”

  1. moneygiay February 10, 2009 at 9:15 am #

    Thank you sharing!

  2. Mehul May 16, 2009 at 9:33 am #

    nice code

  3. jhon December 7, 2011 at 10:50 am #

    Tht was great read, i have came across something similar.. just sharing

    http://www.linuxreaders.com/2011/12/07/php-simple-email-crawler/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: