User:Inge Hoonte/workshop TOS: Difference between revisions
Inge Hoonte (talk | contribs) No edit summary |
No edit summary |
||
Line 27: | Line 27: | ||
===THE ACTUAL COURSE OF INVESTIGATION=== | ===THE ACTUAL COURSE OF INVESTIGATION=== | ||
We compared each individual paragraph to the same paragraph in the later TOS to find the differences. Working in the terminal, the idea was to display the original on the screen, and then print the next different, updated, changed paragraph underneath it. | |||
We compared each individual paragraph to the same paragraph in the later TOS to find the differences. Working in the terminal, the idea was to display the original on the screen, and then print the next different, updated, changed paragraph underneath it. | |||
This is the code we worked with: | |||
<source lang="python"> | |||
import html5lib, lxml, lxml.cssselect | |||
htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False) | |||
from git.repository import Repository | |||
r = Repository("./dataset") | |||
path = "google/video.google.com/support/bin/answer.py?answer=31704" | |||
path = path.split("/") | |||
def treewalk(tree, path): | |||
if not path: | |||
return tree | |||
if path[0] in tree: | |||
child = tree[path[0]] | |||
return treewalk(child, path[1:]) | |||
import codecs | |||
seen = {} | |||
count = 0 | |||
noname = None | |||
lasttext = None | |||
for r in r.rev_list(): | |||
b = treewalk(r.tree, path) | |||
if b and b.name not in seen: | |||
page = htmlparser.parse(b.contents) | |||
selector = lxml.cssselect.CSSSelector("p") | |||
for p in selector(page): | |||
text = "".join(p.itertext()) | |||
if text.startswith("7") and text != lasttext: | |||
lasttext = text | |||
print "-"*20, count + 1 | |||
print text | |||
seen[b.name] = True | |||
count += 1 | |||
</source> | |||
However, not much happened. | However, not much happened. |
Revision as of 16:50, 16 March 2011
Team Timeline
Team Members: Fako, Lieven, Inge
Team Lunch: Peanutbutter, jelly, bread
What do you Agree to when you sign up for a service? With all the services, email accounts, news letters, mailinglists (etc) we sign up to, it's a well known fact that less and less people actually read, in short, are aware of the Terms of Service they agree to.
We departed from Goodiff's archived changes in the Google Video TOS, an example of the terms we commonly click agree to when we use, view, link or upload material.
According to Goodiff, Google Video has made one of the highest number of changes over the past four years.
Assuming Goodiff has a valid reason to monitor this document, we set out to visualize the difference it records between updates of documents. Rather than highlighting the changes, our initial proposal was to white out what was added or changed. Conceptually, theoretically, and ideally, the document will slowly disappear over time.
The original:
Our idealized, theoretical visualization:
THE ACTUAL COURSE OF INVESTIGATION
We compared each individual paragraph to the same paragraph in the later TOS to find the differences. Working in the terminal, the idea was to display the original on the screen, and then print the next different, updated, changed paragraph underneath it.
This is the code we worked with:
import html5lib, lxml, lxml.cssselect
htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False)
from git.repository import Repository
r = Repository("./dataset")
path = "google/video.google.com/support/bin/answer.py?answer=31704"
path = path.split("/")
def treewalk(tree, path):
if not path:
return tree
if path[0] in tree:
child = tree[path[0]]
return treewalk(child, path[1:])
import codecs
seen = {}
count = 0
noname = None
lasttext = None
for r in r.rev_list():
b = treewalk(r.tree, path)
if b and b.name not in seen:
page = htmlparser.parse(b.contents)
selector = lxml.cssselect.CSSSelector("p")
for p in selector(page):
text = "".join(p.itertext())
if text.startswith("7") and text != lasttext:
lasttext = text
print "-"*20, count + 1
print text
seen[b.name] = True
count += 1
However, not much happened.
For a while, that is! In the 42nd revision, all indentations were taken out of the subject headers!!
And at some point an entire new paragraph was added!!!
CHANGING THE TIDE
We decided to steer into a different direction. Up until now, we disregarded analyzing all the boundary text, the paratext, the text framing the actual TOS. From our research, it turned out that most of the changes recorded on Goodiff actually take place there, in the paratext.