Filtering Google using the API

Have you ever tried to search for reviews on the internet for any kind of gadget? I typically search for 'whatever review' and usually my search results are a mess of fake review sites, or the ones that rely solely on rantings by users.

While trying various ways to filter these from my searches I turned to PHP, read on for some information about using the Google API to filter your searches.

Initially I tried to limit my searches by adding "-whatever" to search terms, but this ran into the 10 word search term limit quite quickly. There are other methods of getting the most out of 10 words such as carefull use of the "*" but these are just not goog enough.

To get going with the API you will need your own developer key. They are free but you need to register at http://api.google.com/. Google provides libraries for Java and .Net but you can use any language with SOAP bindings.

I am using PHP and the NuSOAP library. I will not go into the details of SOAP here, you can find a good tutorial on the Zend website.

It is very simple to query Google, the following snippet will do the work. For a full explenation of the various parameters you can see the API Reference.

$ggle = new soapClient('http://api.google.com/search/beta2');

First we create a soapClient instance that points to the API.

$params = array('key' => $gglKey,
              'q' => $query,
              'start' => 0,
              'maxResults' => 10,
              'filter' => false,
              'restrict' => '',
              'safeSearch' => false,
              'lr' => 'en'
              'ie' => '',
              'oe' => '');

This is a array that contains our search parameters, the most important ones here are the $gglKey and the $query, you need to assign these values in your code.

$result = $ggle->call("doGoogleSearch", $params, "urn:GoogleSearch", "urn:GoogleSearch");

This is where it all happens, you are making a call to the Google webservice and the result will end up in $result.

At this point you should have an array in $result that is structured and can either contain warnings, errors or actual search results.

For a full list of entries in the result refer to the API Reference. A few simple ones to look at are:

$result[estimatedTotalResultsCount]
$result[searchTime]
$result[searchComments]
$result[searchTips]

The variable names are pretty self explanatory, if in doubt read the refernece. Walking through the actual results is slightly more complicated, the code below will do it for you.

if (is_array($result['resultElements'])) {
   foreach($result['resultElements'] as $r) {
      print ("<a href='" . $r['URL'] . "'>" . $r['title'] . "</A>\n");
      print ($r['snippet'] . "(" . $r['cachedSize'] . ")\n");
      print ("<'BR>" . "<A HREF ='" . $r['URL'] . "'>" . $r['URL'] . "</A>\n");
      print ("<p>\n");
   }
}

This is the basics of it, to remove the junk URL's from your search simply keep a list of sites to filter and do a test against $result[URL] before showing the result.

One of the draw-backs of the Google API is that it only returns 10 results maximum. With filtering searches it may happen that you filter out too much of the results to get any use out of the results. To get around this simply build your search into a loop - do multiple queries until you have enough results.

I have a very simple implimentation of the looping concept that you can play with here. For a good example search type in "apple ipod review" into the box, leave the ignore list as default and hit search. You should see about 7 pages ignored which means it had to do 3 API calls to show 20 hits. Compare that to a normal Google Query for the same term. Much better :)

1 Comment

I'm very curious how you went about doing the looping sequence to do multiple queries. Your looping concept link is broken: http://www.devco.net/~rip/google2/
I'm getting a "404 Not Found" for that URL.

Leave a comment

Recent Entries

  • flashpolicyd 2.0

    I wrote a multi threaded server for Adobe Flash Policy requests, some background from Adobe:Since policy files were first introduced, Flash Player has recognized /crossdomain.xml...

  • Adventures with Ruby

    Some more about my continuing experiences with ruby, in my last post I saidthe language does what you'd expect and as you'll see in my...

  • New programming language of choice - Ruby

    I have fallen out of love with Perl some time ago, I cannot point to one specific thing about it that put me off, I...

  • On working from home

    I've not been posting much here, work has been incredibly manic the last while, especially I need to still finish off my SSO posts with...

  • Rework of puppet facts for /etc/facts.txt

    Previously I blogged a custom fact that reads /etc/facts.txt to build up some custom facts for use in Puppet manifests, well I've since learned a...

Close