User:Lucia Dossin/Protyping/Modules/Scraping

From XPUB & Lens-Based wiki

Let me Think

For this module, I tried out 'manual' x 'automated' scraping in my self-directed research - exercise #1.

The exercise consists of a script that runs in Greasemonkey and it's meant to be used also by someone with little knowledge of code. More info can be found here.

I manually inspected the HTML/CSS and built the plugin code accordingly. But the HTML/CSS may change in the future. In this case, being able to use Firebug is an important skill. That's what will allow the user to change the code if necessary.

For the SDR exercise I put up a tutorial website. But a tutorial on Firebug on top of that seemed just too much information. So, I built a simple form (FAQ, question #11) that will scrape a URL, search for a few elements and return their CSS identifiers as result. That would allow the user to change the code as explained in the tutorial according to the results given through the form.


I started by writing a python script to scrape a URL and search for a few elements, based on their text.

from __future__ import print_function
import html5lib
from urlparse import urljoin, urldefrag
import urllib2
from urllib2 import urlopen
from xml.etree import ElementTree as ET
import time

#preffix = "http://www.amazon.com/gp/product/B0009Z3MQU"
preffix = "http://www.amazon.co.uk/gp/product/B004DJ51HQ"

f = urlopen(preffix)
parsed = html5lib.parse(f, namespaceHTMLElements=False)

button = parsed.findtext("Add to Basket") #Cart
if button:
    button_inst = button.attrib.get("class")
    button_p = ET.tostring(button_inst, method="text", encoding="utf-8")
    print (button_p) #will print the button's class
else:
    print ('Button not found')

product = parsed.findtext("Learning Resources Answer Buzzers") # Squawkin' Chicken
if product:
    print (product.attrib.get("id")) #will print the product name's id
else:
    print ('Product name not found') 

price = parsed.findtext("£13.49") #$7.22
if price:
    price_inst = price.attrib.get("id")
    price_p = ET.tostring(price_inst, method="text", encoding="utf-8")
    print (price_p) #will print the product price's id
else:
    print ('Product price not found')

The code above returns no elements. -> Check

In order to have the form running, I looked up an equivalent solution in php and ended up using this one: http://simplehtmldom.sourceforge.net/

<?php
include_once('inc/simple_html_dom.php');

if($_POST){
    $website = $_POST['url'];
    $button = $_POST['button'];
    $price = $_POST['price'];
    $title = $_POST['title'];
    $message = "";    
    
    $parse = parse_url($website);
    $domain = $parse['host'];
    
    foreach ($_POST as $d){
        if(empty($d) || $d == '&nbsp;'){    
            echo 'Please fill in all fields.';
            exit;
        }
        $message += $d . ' ' ;
    }

    $html = file_get_html($website);
    
    foreach($html->find('span') as $element){            
            if($element->innertext == $button){
                $p1 = $element->parent;
                $p2 = $p1->parent;
                $p3 = $p2->parent;
                $btn_id = $p3->class;               
               break;
            }
    }
        
    foreach($html->find('span') as $element){
        if($element->innertext == $price){
            $pr_id = $element->id;
            $pr_tag = $element->tag;
            break;            
        }
    }
    
    foreach($html->find('span') as $element){
        if($element->innertext == $title){
            $pr_name = $element->id;
            $pr_name_tag = $element->tag;
           break;
        }
    } ?>
        <table class="identifiers">
        <thead>
        <td>website</td><td>button</td><td>product name</td><td>product price</td>
	</thead>        
        <tr><td><? echo $domain; ?></td><td>.<? echo $btn_id; ?></td><td><? echo $pr_name_tag.'#'.$pr_name; ?></td><td><? echo $pr_tag.'#'.$pr_id; ?></td></tr>
<? 
}else{
    echo '<h1>No, no, no...</h1><p>Please try again, using the proper interface. Thank you.</p><p><a href="./">Go back</a>';
}
//Due to php5 circular references memory leak, after creating DOM object, you must call $dom->clear() to free memory if call file_get_dom() more than once. 
$html->clear(); 
unset($html);
?>