How to write a simple scraper in PHP without Regex

Web scrappers are simple programs that are used to extract certain data from the web. Usually the structure of the the pages is known so scrappers have reduced complexity compared to parsers and crawlers.

In this tutorial we are going to create a simple parser that extract the title and favicon from any html page.

Usually scrappers are based on regular expressions but we are going to avoid them because they are difficult to manage and sometimes they have unexpected results. We are going to use simple php string functions instead.

Let’s assume all the pages we are scrapping have the following structure.

<html>
<title>Page Tile | Site Name</title>
<link rel="shortcut icon" href="http://www.site.com/favicon.ico"/> 
...
<body>
...
<h1>Title</h1>
...
</body>

</html>

We are going to retrieve the title and the favicon from the meta section and if the title is not populated we are going to search for the title inside the body. Those operations can be covered using only 2 simple functions:

function strInBetween($text,
                               $separatorStart,
                               $separatorEnd,
                               &$positionOut = null,
                               $after = null,
                               $afterIndex = 0)
{
	$CYCLE_LIMIT = 500;

	$stop = false;
	$pos = 0;
	$current = strInBetween($text, $separatorStart, $separatorEnd, $pos);
	$matching = strpos($current, $containing);
	$i = 0;
	$afterIndex = 0;
	while ($pos >= 0 && $matching === false && $i < $CYCLE_LIMIT)
	{
		$afterIndex += $pos + strlen($current);
		$current = strInBetween($text, $separatorStart, $separatorEnd, $pos, null, $afterIndex);
		$matching = strpos($current, $containing);
		$i++;
	}
	
	if ($matching)
		return $current;
	else
		return null;
}

This previous function returns the first substring found inside the input $text, surrounded by 2 string delimiters: $separatorStart and $separatorEnd. Optionally, another substring or index could be specified to start the search only after it:

  • $text - the text where to search the substring
  • $separatorStart - the string delimiter in the left
  • $separatorEnd - the string delimiter in the right
  • &$positionOut - optional output parameter, contains the position where the string is.
  • $after - optional parameter, is specified, the result is searched only after $after is encountered
  • $afterIndex - optional parameter used if $after is null. It indicates the index where the search should start.
function strInBetweenContaining($text,
                               $separatorStart,
                               $containing,
                               $separatorEnd,
                               &$pos = null,
                               $after = null,
                               $afterIndex = 0)
{
    if ($after != null)
	{
        $pos = strpos($text, $after);
		if ($pos !== false)
			$afterIndex = $pos;
	}
		
	$start = -1;
	$pos = strpos($text, $separatorStart, $afterIndex);
	if ($pos !== false)
		$start = $pos;
	
	$end = -1;	
	$pos = strpos($text, $separatorEnd, $start + strlen($separatorStart));
	if ($pos !== false)
		$end = $pos;
	
	if ($start < 0 && $end < 0)
	{
		if ($positionOut != null)
			$positionOut = -1;
			
		return null;
	}
	else
	{
		if ($positionOut != null)
			$positionOut = -1;

		return substr($text, $start + strlen($separatorStart), $end - ($start + strlen($separatorStart)));		
	}
}

strInBetweenContaining is based on the first function, and returns the string as mentioned in previous function only if it contains the specified string.

Here is the scraping section:

require_once('util/getpage.php');
require_once('util/stringutils.php');

$url = $_GET['u'];
$page = getpage($url);

function getTitle($page) { return strInBetween($page, '<title>', '</title>'); }
function getH1($page) { return strInBetween($page, '<h1', '</h1>'); }
function getLink($page) 
{ 
	$block = strInBetweenContaining($page, '<link', 'rel="shortcut icon"', '/>'); 
	return strInBetween($block, 'href="', '"');
}

echo getTitle($page) . '<br>';
echo strip_tags('<h1' . getH1($page)) . '<br>';

$favicon = getLink($page);
if ($favicon == null)
	$favicon = $url . 'favicon.ico';
	
//echo $favicon;
echo "<img src='$favicon' />";

The code is pretty much self-explanatory. The title and the first h1 tag is extracted, then the favico. If no favico is found the code returns the default one(according to the convention it should be the "favicon.ico" in the server root).

This example demonstrate how to retrieve a couple of simple values using a simple algorithm. It can be reused and extended to scrap any data from webpages. A regular expression parser would be a more flexible solution but it requires a good regex knowledge.

If you have question, requests or new ideas just use the comments section to submit them.

Did you enjoy this tutorial? Be sure to subscribe to the our RSS feed not to miss our new tutorials!
... or make it popular on

10 Comments

  1. Seems to me that you’re just doing the same as what a regex could do. But with more work and less efficient.

    preg_match(‘/(.*)/imU’, $html, $result);

  2. damn thing stripped my code, why can’t blog post system not use simple things like htmlspecialchars!? ah well.

  3. Fatal error: Call to undefined function getpage() in C:\Program Files (x86)\EasyPHP-5.3.9\www\search\index.php on line 6

    I get this error.

  4. I don’t understand the recursive call in the strInBetween function, line 12. I can sorta see where you’re going with it, but with the arguments used in either of the functions that call it, wouldn’t it endlessly call itself?

    When I made recursive functions, it seemed to me that where ever in the function it calls itself, it must complete the second call before continuing with the first which is why I would structure them such that the call was inside a loop or if statement.

    Perhaps this works differently when you’re technically assigning the result to a variable?

  5. Best solution will be use regular expression to extract using preg match and pull title using curl a working demo will be available at openplus.in/seo/seochecklist.php

Leave a Comment.