ACHDI what has been done so far: Difference between revisions
Tag: Manual revert |
No edit summary |
||
(8 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
[[File:Screenshot 2025-03-10 at 12.46.03.png|thumb|500x500px|Eyes on the Prize]] | |||
== Initial Idea == | == Initial Idea == | ||
The idea came from our discussions on A Dao of Web Design (no thanks to the author). | The idea came from our discussions on A Dao of Web Design (no thanks to the author). | ||
Line 66: | Line 68: | ||
** spec + url > runs script on one named spec sheet | ** spec + url > runs script on one named spec sheet | ||
** broken > runs just the spec sheets that had errors last time | ** broken > runs just the spec sheets that had errors last time | ||
* we have a progress bar:[[File:Screenshot 2025-03-10 at 10.30.26.png|thumb|762x762px| | ** focus > only scrapes for certain params e.g. date, abstract etc | ||
* we have a progress bar: | |||
[[File:Screenshot 2025-03-10 at 10.30.26.png|thumb|762x762px|none]] | |||
== What needs to be done == | == What needs to be done == | ||
Line 76: | Line 81: | ||
#* company the authors and editors are tied to | #* company the authors and editors are tied to | ||
# Figure out how to present it | # Figure out how to present it | ||
## Could it be a timeline of companies/editors over time? | |||
== Glossary specifically for İmre because what is going on??? == | == Glossary specifically for İmre because what is going on??? == | ||
Line 89: | Line 95: | ||
'''async:''' The <code>async function</code> declaration creates a binding of a new async function to a given name. The <code>await</code> keyword is permitted within the function body, enabling asynchronous, promise-based behavior to be written in a cleaner style and avoiding the need to explicitly configure promise chains. | '''async:''' The <code>async function</code> declaration creates a binding of a new async function to a given name. The <code>await</code> keyword is permitted within the function body, enabling asynchronous, promise-based behavior to be written in a cleaner style and avoiding the need to explicitly configure promise chains. | ||
=== Found These: === | |||
https://github.com/w3c | |||
https://dev.w3.org/cvsweb/ |
Latest revision as of 11:32, 11 March 2025
Initial Idea
The idea came from our discussions on A Dao of Web Design (no thanks to the author).
Arising from the idiom "Give a man a fish and he is fed for a day, teach him how to fish and he is fed for life." we decided to ask the question "But who makes the fishhook?"
Through this special issue we delved into the materiality of the web, and focused on different CSS properties, as a result we ended up getting curious about who decides on the properties, how, when and what companies they are affiliated to. (And trust me after spending enough time browsing through the specs you keep on seeing the same names over and over again).
So we decided to scrape all the information we can get on who/when/from where etc. proposed and _____ the different CSS properties. Initially the aim was to turn all the information we scrape into a plug-in (still a viable option, needs further discussions) that shows who came up with the properties used for the website you are currently on.
In any-case the code is shared in github and our hope is that at least it can be used by anyone who wants to get information from the w3c website and their significantly difficult to navigate archives in an organized manner.
We would appreciate any ideas/suggestions/questions on how we can present it, if we decide not to go with the plug-in idea.
What has been done so far
Scrapper.ts
disclaimer: this has been the first trial and afterwords we have changed approaches, the explanation of this part is here more for archival purposes
The idea was to write a typescript code to scrape all the data of css properties from their spec sheets. We decided to follow each property. To reach their specs we used the short-hand property list found in developer.mozilla, first checked if the property in question exists in the doc(from the index), then scrape the author (editor/co-editor took some time to seperate the two), company the authors are from, date, previous versions, and then wrote a loop to keep on scrapping the same data from the previous versions.
As more properties and more spec sheets were scrapped we started running into an issue, as already mentioned we started getting doc type as well, to understand whether the spec sheets were a working draft, editors draft etc. Although most of the initial specs taken from the same doctype had the same HTML structure, as the previous versions started getting older some issues started arising. This led us to realize how many different functions (or if else if statements (?)) would be needed to actually be able to scrape all the necessary information of a single property (especially considering some properties exist in multiple different docs that are all written in various different ways)
This resulted with Fred going "Wait a second, I bet we can just scrape all the spec sheets for the properties instead" (or something like that, I wasn't really there). So he started coding right away, which leads our journey to....
getSpecs.ts
So this starts by getting all the spec sheets from W3C from the working/editor's drafts and docs https://www.w3.org/TR/?filter-tr-name=Css and there is apparently this https://drafts.csswg.org/ (all the working drafts). It scrapes all the specs sheets we can get our hands on, and their previous versions if they have been linked, then scrapes the information above.
[
"https://www.w3.org/TR/css-grid-3/",
"https://www.w3.org/TR/2025/WD-css-grid-3-20250207/",
"https://www.w3.org/TR/2024/WD-css-grid-3-20241003/",
"https://www.w3.org/TR/2024/WD-css-grid-3-20240919/",
"https://www.w3.org/TR/2020/CRD-css-grid-1-20201218/",
"https://www.w3.org/TR/2020/CRD-css-grid-1-20201021/",
... 1138
]
getSpecInfo.ts
So this function can be run on any list of spec sheets to scrape the data out of it
We are currently running the functions iteratively and fixing the problems as it fails since all the spec sheets are formated differently.
So far we have; date, this spec URL, this doc name, and abstract (most of the time).
To Run we can say:
Deno run -A getSpecInfo.ts test 10
This will run our functions for 10 random Spec Sheets from our list of Spec Sheets.
let thisSpecsInfo: SpecSheet = {
authors: await getAuthors($specSheet, specSheet),
editors: await getEditors($specSheet, specSheet),
date: await getDate($specSheet, specSheet),
thisSpecUrl: specSheet,
thisDocName: await getDocName($specSheet, specSheet),
type: await getType($specSheet, specSheet),
properties: await getProps($specSheet, specSheet),
abstract: await getAbstract($specSheet, specSheet),
};
Fred wrote a few tools, so that we can run the function on a random amount of the specs (very useful considering we reached over a thousand so far and it takes timee, also a tool to only run the function on the specs that were last scraped and had an error. This saves a lot of time and makes it esier to go back to the functions and edit them so that we can keep on fixing the issues without needing to take 20 min long ciggie breaks all the time.
Fun features:
- Script Collects and Categorises errors, so we can find issues quickly
- Our CLI takes arguments e.g.
- test + optional number > runs our script on randomly selected spec sheets
- spec + url > runs script on one named spec sheet
- broken > runs just the spec sheets that had errors last time
- focus > only scrapes for certain params e.g. date, abstract etc
- we have a progress bar:
What needs to be done
- Finish the code
- Authors
- Editors (contributors?)
- type
- company the authors and editors are tied to
- Figure out how to present it
- Could it be a timeline of companies/editors over time?
Glossary specifically for İmre because what is going on???
Puppeteer: Puppeteer is a JavaScript library which provides a high-level API to control Chrome or Firefox over the DevTools Protocol or WebDriver BiDi. Puppeteer runs in the headless (no visible UI) by default
TypeScript: TypeScript adds additional syntax to JavaScript to support a tighter integration with your editor. Catch errors early in your editor.
DOM: The Document Object Model (DOM) is a programming interface for web documents. It represents the page so that programs can change the document structure, style, and content. The DOM represents the document as nodes and objects; that way, programming languages can interact with the page.
A web page is a document that can be either displayed in the browser window or as the HTML source. In both cases, it is the same document but the Document Object Model (DOM) representation allows it to be manipulated. As an object-oriented representation of the web page, it can be modified with a scripting language such as JavaScript.
Cheerio: Cheerio parses markup and provides an API for traversing/manipulating the resulting data structure. It does not interpret the result as a web browser does. Specifically, it does not produce a visual rendering, apply CSS, load external resources, or execute JavaScript which is common for a SPA (single page application). This makes Cheerio much, much faster than other solutions. If your use case requires any of this functionality, you should consider browser automation software like Puppeteer and Playwright or DOM emulation projects like JSDom.
async: The async function
declaration creates a binding of a new async function to a given name. The await
keyword is permitted within the function body, enabling asynchronous, promise-based behavior to be written in a cleaner style and avoiding the need to explicitly configure promise chains.