Web scrappers are simple programs that are used to extract certain data from the web. Usually the structure of the the pages is known so scrappers have reduced complexity compared to parsers and crawlers.
In this tutorial we are going to create a simple parser that extract the title and favicon from any html page.
Usually scrappers are based on regular expressions but we are going to avoid them because they are difficult to manage and sometimes they have unexpected results. We are going to use simple php string functions instead.
URL handling is one of the tasks you have to do from time to time in PHP. Sometimes you have to do it because you want to record the referral sites, other times because you want to write your own spider or just because you want to retrieve your current URL.
PHP is a language developed around web for web developers and it contains all the functions you might need in your quests. There is a section in php documentation which groups the URL functions. Along with a few functions used to encode/decode which are rarely used the package contains the functions you can not live without:
It’s not a common problem but sometimes you have to check if 2 texts are similar. If you have to aggregate data from multiple sources you might know what I’m talking about.
The most simple thing you can try is to simply compare the 2 strings. A simple comparison will not help if one of the strings are contains an extra space. A more serious algorithm should be used for such cases. Fortunately php provides us several functions that can be used.