Web scrappers are simple programs that are used to extract certain data from the web. Usually the structure of the the pages is known so scrappers have reduced complexity compared to parsers and crawlers.
In this tutorial we are going to create a simple parser that extract the title and favicon from any html page.
Usually scrappers are based on regular expressions but we are going to avoid them because they are difficult to manage and sometimes they have unexpected results. We are going to use simple php string functions instead.
Cache is a programming concept that can be used in a various range of applications and for various purposes. A cache library can be used for storing database queries for later use, to store rendered pages to be served again without generating them again, or to save indexed pages in a crawler application to be processed by multiple modules.
A cache mechanism is more simple that it might sound. It’s just a simple module that should implement 2 actions:
- to store a value(identified by a key).
- to retrieve a value if it’s not expired.
- additionally it can offer a mechanism to invalidate a set of values or the entire cache.
In this tutorial we are going to create a disk cache script. It stores the string values in files, each value is stored in a file and it contains an additional file to store the expiration date. Performance wise, this is not the best approach, but the script is designed like that with a clear purpose: the additional file can be used to store additional attributes, beside the expiration date. Imagine an application that crawls pages, with different modules. Each time a module crawls the page, it adds it’s result to the additional file.
It happens pretty often for me to have to run shell commands in a hosting environment. I do it all the time via a simple php script. I tested it on godaddy and dreamhost and on other hostings environments and it works fine.
Before starting the tutorial you should note that if this script is not handled carefully it can have undesired results. A wrong rm command can delete all the files you have on your hosting, so run the commands with care.