User:Tash/grad prototyping: Difference between revisions

From XPUB & Lens-Based wiki
No edit summary
No edit summary
Line 17: Line 17:


Documentation: https://docs.scrapy.org/en/latest/index.html
Documentation: https://docs.scrapy.org/en/latest/index.html
===== Scraping headlines from an Indonesian news site: =====
Using a spider to extract header elements (H5) from: http://www.thejakartapost.com/news/index
<source lang = python>
import scrapy
class TitlesSpider(scrapy.Spider):
    name = "titles"
    def start_requests(self):
        urls = [
            'http://www.thejakartapost.com/news/index',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
    def parse(self, response):
        for title in response.css('h5'):
            yield {
                'text': title.css('h5::text').extract()
            }
</source>
Crawling and saving to a json file:
<source lang=bash>
scrapy crawl titles -o titles.json
</source>

Revision as of 13:56, 30 September 2018

Prototyping Session 1 & 2

Every Redaction, by James Bridle


Possible topics to explore:


Learning to use Scrapy

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival. Even though Scrapy was originally designed for web scraping, it can also be used to extract data using APIs (such as Amazon Associates Web Services) or as a general purpose web crawler.

Documentation: https://docs.scrapy.org/en/latest/index.html

Scraping headlines from an Indonesian news site:

Using a spider to extract header elements (H5) from: http://www.thejakartapost.com/news/index

import scrapy
class TitlesSpider(scrapy.Spider):
    name = "titles"

    def start_requests(self):
        urls = [
            'http://www.thejakartapost.com/news/index',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        for title in response.css('h5'):
            yield {
                'text': title.css('h5::text').extract()
            }

Crawling and saving to a json file:

scrapy crawl titles -o titles.json