Archive | Web Crawling RSS feed for this section

Create a PHP web crawler or scraper in 5 minutes

19 Jan

Utilizing the PHP programming language we show you how to create an infinitely extendable web crawler in under 5 minutes, collecting images and links.

The Crawler Framework

First we need to create the crawler class as follows:

<?php
class Crawler {

}

?>

We then will create methods to fetch the web pages markup, and to parse it for data that we are looking at collecting. The only public methods will be getMarkup() and get() as the parsing methods will generally be used privately for the crawler, however the visibility is set to protected since you never know who will want to extend its functionality.

<?php
class Crawler {

protected

$markup = '';

public function

__construct($uri) {

}

public function

getMarkup() {

}

public function

get($type) {

}

protected function

_get_images() {

}

protected function

_get_links() {

}
}

?>

Fetching Site Markup

The constructor will accept a URI so we can instantiate it such as new Crawler(‘http://vision-media.ca&#8217;); which then will set our $markup property using PHP’s file_get_contents() function which fetches the sites markup.

<?php
public function __construct($uri) {
$this->markup = $this->getMarkup($uri);
}

public function

getMarkup($uri) {
return
file_get_contents($uri);
}
?>

Crawling The Markup For Data

Our get() method will accept a $type string which essentially will simply be used to invoke another method actually doing the processing. As you can see below we construct the method name as a string, then make sure it is available so now developers can utilize this simply by invoking $crawl->get(‘images’);

We set visibility for _get_images() and _get_links() to protected so that developers will use our public get() method rather than getting confused and trying to invoke them directly.

Each protected data collection method simply uses the PCRE (Perl Compatible Regular Expressions) function preg_match_all() in order to return all tags within the markup that are accepted using our patterns of /<img([^>]+)\/>/i and /<a([^>]+)\>(.*?)\<\/a\>/i. For more information on regular expressions visit http://en.wikipedia.org/wiki/Regular_expression

<?php
public function get($type) {
$method = "_get_{$type}";
if (
method_exists($this, $method)){
return
call_user_method($method, $this);
}
}

protected function

_get_images() {
if (!empty(
$this->markup)){
preg_match_all('/<img([^>]+)\/>/i', $this->markup, $images);
return !empty(
$images[1]) ? $images[1] : FALSE;
}
}

protected function

_get_links() {
if (!empty(
$this->markup)){
preg_match_all('/<a([^>]+)\>(.*?)\<\/a\>/i', $this->markup, $links);
return !empty(
$links[1]) ? $links[1] : FALSE;
}
}
?>

Final PHP Web Crawler Code And Usage

<?php
class Crawler {

protected

$markup = '';

public function

__construct($uri) {
$this->markup = $this->getMarkup($uri);
}

public function

getMarkup($uri) {
return
file_get_contents($uri);
}

public function

get($type) {
$method = "_get_{$type}";
if (
method_exists($this, $method)){
return
call_user_method($method, $this);
}
}

protected function

_get_images() {
if (!empty(
$this->markup)){
preg_match_all('/<img([^>]+)\/>/i', $this->markup, $images);
return !empty(
$images[1]) ? $images[1] : FALSE;
}
}

protected function

_get_links() {
if (!empty(
$this->markup)){
preg_match_all('/<a([^>]+)\>(.*?)\<\/a\>/i', $this->markup, $links);
return !empty(
$links[1]) ? $links[1] : FALSE;
}
}
}

$crawl = new Crawler('http://vision-media.ca');
$images = $crawl->get('images');
$links = $crawl->get('links');
?>

Advertisements