I’ve spent much of the last day building a little application in preparation for more work on my final year project. As I’ve mentioned before, my project is to build an IDE for the web programming language PHP, so one of the things I have to be able to do is autocomplete and autocorrect PHP function names, and give the user information on functions.
In order to be able to do this I need the information in the first place, after a search online for a machine readable list of functions, their parameters and their descriptions — perhaps in JSON or XML format — I was unable to find anything of use, therefore I decided I would have to scrape this information from the PHP.net website myself.
Scraping for functions was made somewhat easier than I thought it might be because the people at PHP.net maintain a quick reference page avaliable at http://www.php.net/quickref.php. After downloading this my program searches for anchor tags (links) in the unordered list element called ‘quickref_functions’, ensuring that what it is looking at is indeed a page about a function and not a page about a class, by ensuring the url starts with “/manual/en/function”. I then grab the hyperlink reference from each anchor tag.
Once I have parsed the entire page I am left with a list of URLs, each of these URLs points to a page with all the information about a particular function. For example, the first URL I always have in the list is http://www.php.net/manual/en/function.abs.php which gives all the information about the function abs();
I go through and download each of pages at these URLs and parse them, taking out the function signature, which looks like this:
number abs ( mixed $number )
Information about each parameter, the return type, the description and a link to the comments section. In the coming few days I hope to be able to parse out information allowing me to flag deprecated functions, provide popular comments about the function and show code examples amongst other things.
At the moment once I have downloaded and parsed all 4700+ function pages and parsed them I output them to an XML document, eventually I will insert all of them as records into a NoSQL database.
Because downloading and parsing all the information takes quite a while I will ship my IDE with a pre made database of PHP functions, but will allow uses to, via an advanced settings panel, attempt a redownload of all functions in order to download new ones, or changes to old ones. I think this will be a function that only the most advanced users use, if anyone does at all, but it is however an interesting unique selling point to my product.