Scrape alexa rank with PHP

This tutorial will teach you how to scrape Alexa rank for any website. I will use two different methods for scraping the rank. The first one is using regular expressions and the second one is by using a php/jquery library called phpquery.

Parse alexa rank using regular expressions

First thing we should do is to analyse the page and search for the html source code where the rank is displayed.

<a href="/siteowners/certify?ax_atid=b2383bed-44cb-4edb-b1c5-dda06e67bfeb&site=htmlviewer.net">84,685</a>

From this chunk of html code we can construct the following regular expression:

<a href\=\"\/siteowners\/certify\?[^>]+>([0-9\,]+)<\/a>

Where:

  • '[^>]+' - match one or more symbols until you find a '>' sign
  • ([0-9\,]+) - match one or more numbers and commas

function alexa_rank($domain){

    $data = file_get_contents( "http://www.alexa.com/siteinfo/" . $domain );
    if( $data === FALSE ){
        return false;
    }

    $regex = "/<a href\=\"\/siteowners\/certify\?[^>]+>([0-9\,]+)<\/a>/i";
    if( preg_match( $regex, $data, $match ) ){
        return str_replace( ",", "", $match[1] );
    }else{
        $regex = "/<a href\=\"\/siteowners\/certify\?[^>]+><span[^>]+>\-<\/span><\/a>/i";
        if( preg_match( $regex, $data, $match ) ){
            return 0;
        }
    }

    return false;
}

The second regular expression is used to check if alexa does not have rank for the specified domain, so we can know whether the parser is working or not.

Parse alexa rank using phpquery library

As the documentation says: phpQuery is a CSS3 selector driven Document Object Model API based on jQuery JavaScript Library. So if you are good with jQuery, you will find this very handy.

include('phpQuery.php');

function alexa_rank($domain){
    $data = file_get_contents("http://www.alexa.com/siteinfo/" . $domain);
    $doc = phpQuery::newDocument($data);
    echo pq('.metricsUrl a')->text();
}

Because i couldn't find a unique identifier for the 'a' html tag, i used the 'metricsUrl' class name which is located in the 'span' html tag.

Conclusion

The phpquery method is far way slower than the regular expression, because it needs to parse and load the whole DOM in memory. Another way to scrape and parse alexa rank is by using the 'PHP Simple HTML DOM Parser'.

- Posted by Ana to Php