User:Tash/grad prototyping: Difference between revisions

From XPUB & Lens-Based wiki
No edit summary
Line 59: Line 59:


Using html5lib and elementtree to scrape news sites!
Using html5lib and elementtree to scrape news sites!
<source lang=python>
import html5lib
from xml.etree import ElementTree as ET
from urllib.request import urlopen
with urlopen('https://www.dailymail.co.uk') as f:
t = html5lib.parse(f, namespaceHTMLElements=False)
#finding specific words in text content
for x in t.iter():
if x.text != None and 'trump' in x.text.lower() and x.tag != 'script':
print (x.text)
</source>


See workshop pad here: https://pad.xpub.nl/p/pyratechnic1
See workshop pad here: https://pad.xpub.nl/p/pyratechnic1

Revision as of 08:59, 4 October 2018

Prototyping Session 1 & 2

Every Redaction, by James Bridle


Possible topics to explore:


Learning to use Scrapy

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival. Even though Scrapy was originally designed for web scraping, it can also be used to extract data using APIs (such as Amazon Associates Web Services) or as a general purpose web crawler.

Documentation: https://docs.scrapy.org/en/latest/index.html

Scraping headlines from an Indonesian news site:
Screen Shot Scrapynews1.png

Using a spider to extract header elements (H5) from: http://www.thejakartapost.com/news/index

import scrapy
class TitlesSpider(scrapy.Spider):
    name = "titles"

    def start_requests(self):
        urls = [
            'http://www.thejakartapost.com/news/index',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        for title in response.css('h5'):
            yield {
                'text': title.css('h5::text').extract()
            }

Crawling and saving to a json file:

scrapy crawl titles -o titles.json



To explore
  • NewsDiffs – as a way to expose the historiography of an article
  • how about looking at comments? what can you scrape (and analyse) from social media?
  • how far can you go without using an API?
  • self-censorship: can you track the things people write but then retract?
  • An Anthem to Open Borders

Pyratechnic1

Using html5lib and elementtree to scrape news sites!

import html5lib
from xml.etree import ElementTree as ET 
from urllib.request import urlopen

with urlopen('https://www.dailymail.co.uk') as f:
	t = html5lib.parse(f, namespaceHTMLElements=False)

#finding specific words in text content
for x in t.iter():
	if x.text != None and 'trump' in x.text.lower() and x.tag != 'script':
		print (x.text)

See workshop pad here: https://pad.xpub.nl/p/pyratechnic1