User:Inge Hoonte/workshop TOS: Difference between revisions
No edit summary |
(ideas: " to find the common paratext among the document and showing the real content") |
||
(One intermediate revision by one other user not shown) | |||
Line 91: | Line 91: | ||
We decided to steer into a different direction. Up until now, we disregarded analyzing all the boundary text, the paratext, the text framing the actual TOS. From our research, it turned out that most of the changes recorded on Goodiff actually take place there, in the paratext. | We decided to steer into a different direction. Up until now, we disregarded analyzing all the boundary text, the paratext, the text framing the actual TOS. From our research, it turned out that most of the changes recorded on Goodiff actually take place there, in the paratext. | ||
The original: | The original:<br> | ||
[[File:theoriginal.png | 900px]] | [[File:theoriginal.png | 900px]]<br> | ||
The conceptual, theoretical, gimped result: | The conceptual, theoretical, gimped result:<br> | ||
[[File:thechange.png | 900px]] | [[File:thechange.png | 900px]] | ||
=== Ideas === | |||
Having a "disappearing" text after many updates is an interesting idea. The "paratext" is indeed a key point that could be compared among the other documents and to find the common paratext among the document and showing the real content. Testing "merging" (e.g. octopus, classical diff) strategies between documents could give also a nice view of the "legal" structure for all TOS documents. |
Latest revision as of 13:50, 19 April 2011
Team Timeline
Team Members: Fako, Lieven, Inge
Team Lunch: Peanutbutter, jelly, bread
What do you Agree to when you sign up for a service? With all the services, email accounts, news letters, mailinglists (etc) we sign up to, it's a well known fact that less and less people actually read, in short, are aware of the Terms of Service they agree to.
We departed from Goodiff's archived changes in the Google Video TOS, an example of the terms we commonly click agree to when we use, view, link or upload material.
According to Goodiff, Google Video has made one of the highest number of changes over the past four years.
Assuming Goodiff has a valid reason to monitor this document, we set out to visualize the difference it records between updates of documents. Rather than highlighting the changes, our initial proposal was to white out what was added or changed. Conceptually, theoretically, and ideally, the document will slowly disappear over time.
The original:
Our idealized, theoretical visualization:
THE ACTUAL COURSE OF INVESTIGATION
We compared each individual paragraph to the same paragraph in the later TOS to find the differences. Working in the terminal, the idea was to display the original on the screen, and then print the next different, updated, changed paragraph underneath it.
This is the code we worked with:
import html5lib, lxml, lxml.cssselect
htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False)
from git.repository import Repository
r = Repository("./dataset")
path = "google/video.google.com/support/bin/answer.py?answer=31704"
path = path.split("/")
def treewalk(tree, path):
if not path:
return tree
if path[0] in tree:
child = tree[path[0]]
return treewalk(child, path[1:])
import codecs
seen = {}
count = 0
noname = None
lasttext = None
for r in r.rev_list():
b = treewalk(r.tree, path)
if b and b.name not in seen:
page = htmlparser.parse(b.contents)
selector = lxml.cssselect.CSSSelector("p")
for p in selector(page):
text = "".join(p.itertext())
if text.startswith("7") and text != lasttext:
lasttext = text
print "-"*20, count + 1
print text
seen[b.name] = True
count += 1
However, not much happened.
For a while, that is! In the 42nd revision, all indentations were taken out of the subject headers!!
And at some point an entire new paragraph was added!!!
CHANGING THE TIDE
We decided to steer into a different direction. Up until now, we disregarded analyzing all the boundary text, the paratext, the text framing the actual TOS. From our research, it turned out that most of the changes recorded on Goodiff actually take place there, in the paratext.
The conceptual, theoretical, gimped result:
Ideas
Having a "disappearing" text after many updates is an interesting idea. The "paratext" is indeed a key point that could be compared among the other documents and to find the common paratext among the document and showing the real content. Testing "merging" (e.g. octopus, classical diff) strategies between documents could give also a nice view of the "legal" structure for all TOS documents.