User:Inge Hoonte/workshop TOS: Difference between revisions

Revision as of 16:50, 16 March 2011

Team Timeline

Team Members: Fako, Lieven, Inge

Team Lunch: Peanutbutter, jelly, bread

What do you Agree to when you sign up for a service? With all the services, email accounts, news letters, mailinglists (etc) we sign up to, it's a well known fact that less and less people actually read, in short, are aware of the Terms of Service they agree to.

We departed from Goodiff's archived changes in the Google Video TOS, an example of the terms we commonly click agree to when we use, view, link or upload material.

According to Goodiff, Google Video has made one of the highest number of changes over the past four years.

Assuming Goodiff has a valid reason to monitor this document, we set out to visualize the difference it records between updates of documents. Rather than highlighting the changes, our initial proposal was to white out what was added or changed. Conceptually, theoretically, and ideally, the document will slowly disappear over time.

The original:

Our idealized, theoretical visualization:

THE ACTUAL COURSE OF INVESTIGATION

We compared each individual paragraph to the same paragraph in the later TOS to find the differences. Working in the terminal, the idea was to display the original on the screen, and then print the next different, updated, changed paragraph underneath it.

This is the code we worked with:

import html5lib, lxml, lxml.cssselect
 
htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False)

from git.repository import Repository
 
r = Repository("./dataset")
path = "google/video.google.com/support/bin/answer.py?answer=31704"
path = path.split("/")
 
def treewalk(tree, path):
    if not path:
        return tree
    if path[0] in tree:
        child = tree[path[0]]
        return treewalk(child, path[1:])
 
import codecs
 
 
seen = {}
count = 0
noname = None
lasttext = None
for r in r.rev_list():
    b = treewalk(r.tree, path)
    if b and b.name not in seen:
 
        page = htmlparser.parse(b.contents)
        selector = lxml.cssselect.CSSSelector("p")
        for p in selector(page):
            text = "".join(p.itertext())
            if text.startswith("7") and text != lasttext:
                lasttext = text
                print "-"*20, count + 1
                print text
        
        seen[b.name] = True
        count += 1

However, not much happened.

For a while, that is! In the 42nd revision, all indentations were taken out of the subject headers!!

And at some point an entire new paragraph was added!!!

CHANGING THE TIDE

We decided to steer into a different direction. Up until now, we disregarded analyzing all the boundary text, the paratext, the text framing the actual TOS. From our research, it turned out that most of the changes recorded on Goodiff actually take place there, in the paratext.

The original:

The conceptual, theoretical, gimped result: