User:Inge Hoonte/workshop TOS

From XPUB & Lens-Based wiki

Team Timeline

Team Members: Fako, Lieven, Inge

Team Lunch: Peanutbutter, jelly, bread

What do you Agree to when you sign up for a service? With all the services, email accounts, news letters, mailinglists (etc) we sign up to, it's a well known fact that less and less people actually read, in short, are aware of the Terms of Service they agree to.

I accept.png

We departed from Goodiff's archived changes in the Google Video TOS, an example of the terms we commonly click agree to when we use, view, link or upload material.

According to Goodiff, Google Video has made one of the highest number of changes over the past four years. Revisions.png

Assuming Goodiff has a valid reason to monitor this document, we set out to visualize the difference it records between updates of documents. Rather than highlighting the changes, our initial proposal was to white out what was added or changed. Conceptually, theoretically, and ideally, the document will slowly disappear over time.

The original:

Thestart.png

Our idealized, theoretical visualization:

Thevision.png


THE ACTUAL COURSE OF INVESTIGATION

We compared each individual paragraph to the same paragraph in the later TOS to find the differences. Working in the terminal, the idea was to display the original on the screen, and then print the next different, updated, changed paragraph underneath it.

This is the code we worked with:

import html5lib, lxml, lxml.cssselect
 
htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False)

from git.repository import Repository
 
r = Repository("./dataset")
path = "google/video.google.com/support/bin/answer.py?answer=31704"
path = path.split("/")
 
def treewalk(tree, path):
    if not path:
        return tree
    if path[0] in tree:
        child = tree[path[0]]
        return treewalk(child, path[1:])
 
import codecs
 
 
seen = {}
count = 0
noname = None
lasttext = None
for r in r.rev_list():
    b = treewalk(r.tree, path)
    if b and b.name not in seen:
 
        page = htmlparser.parse(b.contents)
        selector = lxml.cssselect.CSSSelector("p")
        for p in selector(page):
            text = "".join(p.itertext())
            if text.startswith("7") and text != lasttext:
                lasttext = text
                print "-"*20, count + 1
                print text
        
        seen[b.name] = True
        count += 1

However, not much happened.

NotthingHappens.png

For a while, that is! In the 42nd revision, all indentations were taken out of the subject headers!!

SomethingHappend.png

And at some point an entire new paragraph was added!!!

Payment.png

CHANGING THE TIDE

We decided to steer into a different direction. Up until now, we disregarded analyzing all the boundary text, the paratext, the text framing the actual TOS. From our research, it turned out that most of the changes recorded on Goodiff actually take place there, in the paratext.

The original:
Theoriginal.png

The conceptual, theoretical, gimped result:
Thechange.png

Ideas

Having a "disappearing" text after many updates is an interesting idea. The "paratext" is indeed a key point that could be compared among the other documents and to find the common paratext among the document and showing the real content. Testing "merging" (e.g. octopus, classical diff) strategies between documents could give also a nice view of the "legal" structure for all TOS documents.