Prototyping/2018-01-17
INPUTS: Getting at text in various forms of (un)structure
Text in various forms of (un)structure + PYTHON basics
Reading from a text file
import sys
for line in sys.stdin:
print (line)
stdin, stdout, stderr
Reading from a spreadsheet
CSV is short of Comma Separated Values and is a simple spreadsheet format. Excel files can be converted to CSV.
import sys
from csv import DictReader
for row in DictReader(sys.stdin):
print (row)
Scraping a web page
HTML parsing used to be quite complex. Luckily these days the html5lib library deals with most idiosyncracies of web pages "in the wild" and gives access based on the "ElementTree" interface.
from html5lib import parse
from urllib.request import urlopen
f = urlopen("http://www.gutenberg.org/files/56372/56372-h/56372-h.htm")
tree = parse(f, namespaceHTMLElements=False)
for x in tree.findall(".//h2"):
print (x.text)
Maybe another example using the CIA World Factbook
Using a Feed
See http://www.bbc.co.uk/news/10628494 for instance for a selection of feeds of the BBC News website.
from feedparser import parse
feed = parse("http://feeds.bbci.co.uk/news/rss.xml")
for item in feed.entries:
print (item.title)
Reading JSON (mediawiki API example... is the dynamic sandbox working on pzi?)
TRANSFORMATIONS: Explore NLTK and search for interesting things
Working from the NLTK book, find algorithms/functions/transformations that seem potentially of interest from an Oulipodlian perspective.
OUTPUTS: Ways to make a page
HTML output
HTML to print
PDFs with Reportlab
Using RML
See http://www.reportlab.com/docs/rml-for-idiots.pdf
CSV output
ElementTree output
ElementTree is also useful for modifying an existing web page structure for later output.
from xml.etree import ElementTree as ET
print (ET.tostring(tree, method="html", encoding="unicode"))
"Raw text"...