PHP HTML CSS Tutorials

Tutorials, Resources and Snippets

How to write a simple scraper in PHP without Regex

9 comments

Web scrappers are simple programs that are used to extract certain data from the web. Usually the structure of the the pages is known so scrappers have reduced complexity compared to parsers and crawlers.

In this tutorial we are going to create a simple parser that extract the title and favicon from any html page.

Usually scrappers are based on regular expressions but we are going to avoid them because they are difficult to manage and sometimes they have unexpected results. We are going to use simple php string functions instead.

Let’s assume all the pages we are scrapping have the following structure.

<html>
<title>Page Tile | Site Name</title>
<link rel="shortcut icon" href="http://www.site.com/favicon.ico"/> 
...
<body>
...
<h1>Title</h1>
...
</body>

</html>

We are going to retrieve the title and the favicon from the meta section and if the title is not populated we are going to search for the title inside the body. Those operations can be covered using only 2 simple functions:

function strInBetween($text,
                               $separatorStart,
                               $separatorEnd,
                               &$positionOut = null,
                               $after = null,
                               $afterIndex = 0)
{
	$CYCLE_LIMIT = 500;

	$stop = false;
	$pos = 0;
	$current = strInBetween($text, $separatorStart, $separatorEnd, $pos);
	$matching = strpos($current, $containing);
	$i = 0;
	$afterIndex = 0;
	while ($pos >= 0 && $matching === false && $i < $CYCLE_LIMIT)
	{
		$afterIndex += $pos + strlen($current);
		$current = strInBetween($text, $separatorStart, $separatorEnd, $pos, null, $afterIndex);
		$matching = strpos($current, $containing);
		$i++;
	}
	
	if ($matching)
		return $current;
	else
		return null;
}

This previous function returns the first substring found inside the input $text, surrounded by 2 string delimiters: $separatorStart and $separatorEnd. Optionally, another substring or index could be specified to start the search only after it:

  • $text - the text where to search the substring
  • $separatorStart - the string delimiter in the left
  • $separatorEnd - the string delimiter in the right
  • &$positionOut - optional output parameter, contains the position where the string is.
  • $after - optional parameter, is specified, the result is searched only after $after is encountered
  • $afterIndex - optional parameter used if $after is null. It indicates the index where the search should start.
function strInBetweenContaining($text,
                               $separatorStart,
                               $containing,
                               $separatorEnd,
                               &$pos = null,
                               $after = null,
                               $afterIndex = 0)
{
    if ($after != null)
	{
        $pos = strpos($text, $after);
		if ($pos !== false)
			$afterIndex = $pos;
	}
		
	$start = -1;
	$pos = strpos($text, $separatorStart, $afterIndex);
	if ($pos !== false)
		$start = $pos;
	
	$end = -1;	
	$pos = strpos($text, $separatorEnd, $start + strlen($separatorStart));
	if ($pos !== false)
		$end = $pos;
	
	if ($start < 0 && $end < 0)
	{
		if ($positionOut != null)
			$positionOut = -1;
			
		return null;
	}
	else
	{
		if ($positionOut != null)
			$positionOut = -1;

		return substr($text, $start + strlen($separatorStart), $end - ($start + strlen($separatorStart)));		
	}
}

strInBetweenContaining is based on the first function, and returns the string as mentioned in previous function only if it contains the specified string.

Here is the scraping section:

require_once('util/getpage.php');
require_once('util/stringutils.php');

$url = $_GET['u'];
$page = getpage($url);

function getTitle($page) { return strInBetween($page, '<title>', '</title>'); }
function getH1($page) { return strInBetween($page, '<h1', '</h1>'); }
function getLink($page) 
{ 
	$block = strInBetweenContaining($page, '<link', 'rel="shortcut icon"', '/>'); 
	return strInBetween($block, 'href="', '"');
}

echo getTitle($page) . '<br>';
echo strip_tags('<h1' . getH1($page)) . '<br>';

$favicon = getLink($page);
if ($favicon == null)
	$favicon = $url . 'favicon.ico';
	
//echo $favicon;
echo "<img src='$favicon' />";

The code is pretty much self-explanatory. The title and the first h1 tag is extracted, then the favico. If no favico is found the code returns the default one(according to the convention it should be the "favicon.ico" in the server root).

This example demonstrate how to retrieve a couple of simple values using a simple algorithm. It can be reused and extended to scrap any data from webpages. A regular expression parser would be a more flexible solution but it requires a good regex knowledge.

If you have question, requests or new ideas just use the comments section to submit them.

Did you enjoy this tutorial? Be sure to subscribe to the our RSS feed not to miss our new tutorials!
... or make it popular on

Written by admin

June 15th, 2010 at 6:19 am

Posted in howto,parsing,Util

Tagged with , , , ,

9 Responses to 'How to write a simple scraper in PHP without Regex'

Subscribe to comments with RSS or TrackBack to 'How to write a simple scraper in PHP without Regex'.

  1. This is true that the regex ( regular expressions ) are difficult. But once you understand it it saves your most of the works. You can learn the PHP regular expressions from basics from http://www.99tutes.com/content/learn-reular-expressions.html here.

    shirish v

    16 Jun 10 at 2:59 am

  2. Nice explanation. But still the strpos matching is considerably more work than a simple regex.

    mario

    17 Jun 10 at 6:56 am

  3. Seems to me that you’re just doing the same as what a regex could do. But with more work and less efficient.

    preg_match(‘/(.*)/imU’, $html, $result);

    James Dempster

    17 Jun 10 at 1:21 pm

  4. damn thing stripped my code, why can’t blog post system not use simple things like htmlspecialchars!? ah well.

    James Dempster

    17 Jun 10 at 1:22 pm

  5. Or you could just use the PHP Simple HTML DOM Parser instead. Much easier http://simplehtmldom.sourceforge.net/

    Chris

    17 Jun 10 at 3:00 pm

  6. Fatal error: Call to undefined function getpage() in C:\Program Files (x86)\EasyPHP-5.3.9\www\search\index.php on line 6

    I get this error.

    Cristi

    3 Jun 12 at 4:48 am

  7. I don’t understand the recursive call in the strInBetween function, line 12. I can sorta see where you’re going with it, but with the arguments used in either of the functions that call it, wouldn’t it endlessly call itself?

    When I made recursive functions, it seemed to me that where ever in the function it calls itself, it must complete the second call before continuing with the first which is why I would structure them such that the call was inside a loop or if statement.

    Perhaps this works differently when you’re technically assigning the result to a variable?

    Devon

    4 Sep 12 at 12:41 am

  8. Best solution will be use regular expression to extract using preg match and pull title using curl a working demo will be available at openplus.in/seo/seochecklist.php

    Adroit Seo

    19 Oct 12 at 11:01 am

  9. How to take data if login required??

    Shojol80

    2 May 13 at 6:13 pm

Leave a Reply