INPUTS: Getting at text in various forms of (un)structure

Text in various forms of (un)structure + PYTHON basics

Reading from a text file

import sys
for line in sys.stdin:
    print (line)

stdin, stdout, stderr

Reading from a spreadsheet

CSV is short of Comma Separated Values and is a simple spreadsheet format. Excel files can be converted to CSV.

import sys
from csv import DictReader

for row in DictReader(sys.stdin):
    print (row)

Scraping a web page

HTML parsing used to be quite complex. Luckily these days the html5lib library deals with most idiosyncracies of web pages "in the wild" and gives access based on the "ElementTree" interface.

from html5lib import parse
from urllib.request import urlopen

f = urlopen("http://www.gutenberg.org/files/56372/56372-h/56372-h.htm")
tree = parse(f, namespaceHTMLElements=False)
for x in tree.findall(".//h2"):
    print (x.text)

Maybe another example using the CIA World Factbook

Using a Feed

See http://www.bbc.co.uk/news/10628494 for instance for a selection of feeds of the BBC News website.

from feedparser import parse
feed = parse("http://feeds.bbci.co.uk/news/rss.xml")
for item in feed.entries:
    print (item.title)

Reading JSON (mediawiki API example... is the dynamic sandbox working on pzi?)

TRANSFORMATIONS: Explore NLTK and search for interesting things

Working from the NLTK book, find algorithms/functions/transformations that seem potentially of interest from an Oulipodlian perspective.

OUTPUTS: Ways to make a page

HTML output

HTML to print

PDFs with Reportlab

Using RML

See http://www.reportlab.com/docs/rml-for-idiots.pdf

CSV output

ElementTree output

ElementTree is also useful for modifying an existing web page structure for later output.

from xml.etree import ElementTree as ET
print (ET.tostring(tree, method="html", encoding="unicode"))

"Raw text"...

Prototyping/2018-01-17

Contents