How to write a simple scraper in PHP without Regex
howto, parsing, Util June 15th, 2010Web scrappers are simple programs that are used to extract certain data from the web. Usually the structure of the the pages is known so scrappers have reduced complexity compared to parsers and crawlers.
In this tutorial we are going to create a simple parser that extract the title and favicon from any html page.
Usually scrappers are based on regular expressions but we are going to avoid them because they are difficult to manage and sometimes they have unexpected results. We are going to use simple php string functions instead.
Let’s assume all the pages we are scrapping have the following structure.
<html> <title>Page Tile | Site Name</title> <link rel="shortcut icon" href="http://www.site.com/favicon.ico"/> ... <body> ... <h1>Title</h1> ... </body> </html>
We are going to retrieve the title and the favicon from the meta section and if the title is not populated we are going to search for the title inside the body. Those operations can be covered using only 2 simple functions:
function strInBetween($text,
$separatorStart,
$separatorEnd,
&$positionOut = null,
$after = null,
$afterIndex = 0)
{
$CYCLE_LIMIT = 500;
$stop = false;
$pos = 0;
$current = strInBetween($text, $separatorStart, $separatorEnd, $pos);
$matching = strpos($current, $containing);
$i = 0;
$afterIndex = 0;
while ($pos >= 0 && $matching === false && $i < $CYCLE_LIMIT)
{
$afterIndex += $pos + strlen($current);
$current = strInBetween($text, $separatorStart, $separatorEnd, $pos, null, $afterIndex);
$matching = strpos($current, $containing);
$i++;
}
if ($matching)
return $current;
else
return null;
}
This previous function returns the first substring found inside the input $text, surrounded by 2 string delimiters: $separatorStart and $separatorEnd. Optionally, another substring or index could be specified to start the search only after it:
- $text - the text where to search the substring
- $separatorStart - the string delimiter in the left
- $separatorEnd - the string delimiter in the right
- &$positionOut - optional output parameter, contains the position where the string is.
- $after - optional parameter, is specified, the result is searched only after $after is encountered
- $afterIndex - optional parameter used if $after is null. It indicates the index where the search should start.
function strInBetweenContaining($text,
$separatorStart,
$containing,
$separatorEnd,
&$pos = null,
$after = null,
$afterIndex = 0)
{
if ($after != null)
{
$pos = strpos($text, $after);
if ($pos !== false)
$afterIndex = $pos;
}
$start = -1;
$pos = strpos($text, $separatorStart, $afterIndex);
if ($pos !== false)
$start = $pos;
$end = -1;
$pos = strpos($text, $separatorEnd, $start + strlen($separatorStart));
if ($pos !== false)
$end = $pos;
if ($start < 0 && $end < 0)
{
if ($positionOut != null)
$positionOut = -1;
return null;
}
else
{
if ($positionOut != null)
$positionOut = -1;
return substr($text, $start + strlen($separatorStart), $end - ($start + strlen($separatorStart)));
}
}
strInBetweenContaining is based on the first function, and returns the string as mentioned in previous function only if it contains the specified string.
Here is the scraping section:
require_once('util/getpage.php');
require_once('util/stringutils.php');
$url = $_GET['u'];
$page = getpage($url);
function getTitle($page) { return strInBetween($page, '<title>', '</title>'); }
function getH1($page) { return strInBetween($page, '<h1', '</h1>'); }
function getLink($page)
{
$block = strInBetweenContaining($page, '<link', 'rel="shortcut icon"', '/>');
return strInBetween($block, 'href="', '"');
}
echo getTitle($page) . '<br>';
echo strip_tags('<h1' . getH1($page)) . '<br>';
$favicon = getLink($page);
if ($favicon == null)
$favicon = $url . 'favicon.ico';
//echo $favicon;
echo "<img src='$favicon' />";
The code is pretty much self-explanatory. The title and the first h1 tag is extracted, then the favico. If no favico is found the code returns the default one(according to the convention it should be the "favicon.ico" in the server root).
This example demonstrate how to retrieve a couple of simple values using a simple algorithm. It can be reused and extended to scrap any data from webpages. A regular expression parser would be a more flexible solution but it requires a good regex knowledge.
If you have question, requests or new ideas just use the comments section to submit them.




June 16th, 2010 at 2:59 am
This is true that the regex ( regular expressions ) are difficult. But once you understand it it saves your most of the works. You can learn the PHP regular expressions from basics from http://www.99tutes.com/content/learn-reular-expressions.html here.
June 17th, 2010 at 6:56 am
Nice explanation. But still the strpos matching is considerably more work than a simple regex.
June 17th, 2010 at 1:21 pm
Seems to me that you’re just doing the same as what a regex could do. But with more work and less efficient.
preg_match(‘/(.*)/imU’, $html, $result);
June 17th, 2010 at 1:22 pm
damn thing stripped my code, why can’t blog post system not use simple things like htmlspecialchars!? ah well.
June 17th, 2010 at 3:00 pm
Or you could just use the PHP Simple HTML DOM Parser instead. Much easier http://simplehtmldom.sourceforge.net/