Web scrappers are simple programs that are used to extract certain data from the web. Usually the structure of the the pages is known so scrappers have reduced complexity compared to parsers and crawlers.
In this tutorial we are going to create a simple parser that extract the title and favicon from any html page.
Usually scrappers are based on regular expressions but we are going to avoid them because they are difficult to manage and sometimes they have unexpected results. We are going to use simple php string functions instead.
Cache is a programming concept that can be used in a various range of applications and for various purposes. A cache library can be used for storing database queries for later use, to store rendered pages to be served again without generating them again, or to save indexed pages in a crawler application to be processed by multiple modules.
A cache mechanism is more simple that it might sound. It’s just a simple module that should implement 2 actions:
- to store a value(identified by a key).
- to retrieve a value if it’s not expired.
- additionally it can offer a mechanism to invalidate a set of values or the entire cache.
In this tutorial we are going to create a disk cache script. It stores the string values in files, each value is stored in a file and it contains an additional file to store the expiration date. Performance wise, this is not the best approach, but the script is designed like that with a clear purpose: the additional file can be used to store additional attributes, beside the expiration date. Imagine an application that crawls pages, with different modules. Each time a module crawls the page, it adds it’s result to the additional file.
URL handling is one of the tasks you have to do from time to time in PHP. Sometimes you have to do it because you want to record the referral sites, other times because you want to write your own spider or just because you want to retrieve your current URL.
PHP is a language developed around web for web developers and it contains all the functions you might need in your quests. There is a section in php documentation which groups the URL functions. Along with a few functions used to encode/decode which are rarely used the package contains the functions you can not live without:
It happens pretty often for me to have to run shell commands in a hosting environment. I do it all the time via a simple php script. I tested it on godaddy and dreamhost and on other hostings environments and it works fine.
Before starting the tutorial you should note that if this script is not handled carefully it can have undesired results. A wrong rm command can delete all the files you have on your hosting, so run the commands with care.
It’s not a common problem but sometimes you have to check if 2 texts are similar. If you have to aggregate data from multiple sources you might know what I’m talking about.
The most simple thing you can try is to simply compare the 2 strings. A simple comparison will not help if one of the strings are contains an extra space. A more serious algorithm should be used for such cases. Fortunately php provides us several functions that can be used.