Digital zines I: PDF
PDF zining
Today: digital zine making with PDF (using the commandline and bash scripting)
A Zine Protocol?
In a recent blog post, titled Non-realtime publishing for censorship resistance, the author, Björn Edström, ironically(?) proposes zines as a novel new technical protocol (or “class of services”) as an alternative for traditional online publishing, which he sees as fragile and too easily censored.
The main idea is that you have one or more owners. Think of an owner as a public/private keypair. The owner manipulate (offline) a collection of files (a “zine”). This collection is encrypted, signed and published periodically. It doesn’t matter where they are published, could be anywhere on the Internet, in an actual magazine, or maybe on a Tor hidden service. It doesn’t matter.
The users can perform actions on the zine by relaying (somehow) requests to the owners. The owners may then chose to apply the action and thus manipulate/update the collection, and then publish a new collection/zine for consumption by the users. In practice they will probably have a little application for manipulating the zine, like a command line tool, text editor, or something, that handles the formatting/marshalling.
Note: Marshalling, a term with a military origin, in a techincal sense is about packaging digital stuff in a way that can be transmitted and shared. source
Post-digital
In the post, Edström is speaking to a technical audience warning of the “centralizing” tendencies and fragility of many technical solutions, despite the fact that online publishing is often portrayed as inherently decentralized. You could see this posting as part of larger contemporary phenonenon of a backlash against many of the promises of the Web and the Internet (what could be called “Web 2.0”), as, in this case, a programmer realizes that despite the conventional wisdom, for instance: the Internet being inherhently de-centralized and thus somehow impervious to censorship, and recalls that earlier technologies/media/publishing practices may in fact already have done a better job. This movement could be termed part of what’s called “post-digital”. In the case of the blog post, what’s not clear in the end however, is whether Edström is interested in the actual history of zine publishing, and whether he thinks there might be specific lessons might be taken from this already established practice.
DIY: The rise of Lo-Fi Culture, Amy Spencer
Amy Spencer’s DIY: The rise of Lo-Fi Culture (pdf) is a telling of the history of zines, which she defines as “non-commercial, small-circulation publications which are producedand distributed by their creators”.
Tom Jennings
Tom Jennings is an interesting figure at an intersection of two histories. First, he is one of the key authors of homocore, a queer zine described by Spencer. Second, Jennings is the author of the fidonet protocol and plays an important role in the history of early bulletin board systems. Jennings is featured in “BBS documentary”, a project of Jason Scott of the project archive team
bbs: baud bbs: sysops bbs: fidonet
Commandline
In addition to a certain DIY aesthetic, doing things on the commandline is quite powerful, because whatever can be accomplished once, no matter how obscure the commands or effort to get a script to work, that script can then become part of a larger program (say in a loop) and form the basis of new novel publishing workflows.
BASH Loops
BASH is the so called "Bourne again" shell -- aka the command line -- and it's a scripting language.
In programming there are only 3 imporant cases: zero, one, and more than one
When you have “more than one” thing – think loop. In BASH there are loops modelled on how they work in C.
#!/bin/bash for (( c=1; c<=5; c++ )) do echo "Welcome $c times" done
You can also nest loops. It may be useful to think of a clock when nesting loops -- the outer loop is the "slower" of the two (hour hand) compared to the faster inner loop (minute hand).
for (( h=0; h<=23; h++ )) do for (( m=0; m<=59; m++ )) do echo "The time is: $h:$m" done done
Concrete example: Downloading an image from online
A usefule commandline tool for downloading things that are online is wget. (See also: curl)
Let’s say we want to download all the images from homocore issue #1. They are (happily, but precariously) still available via links held in the “wayback” machine hosted by archive.org.
Downloading the images:
https://web.archive.org/web/20041020175747/http://www.wps.com/archives/HOMOCORE/1.html
Following the link:
<https://web.archive.org/web/20041020175747/http://www.wps.com/archives/HOMOCORE/1/1.JPG>
try with wget:
wget https://web.archive.org/web/20041020175747/http://www.wps.com/archives/HOMOCORE/1/1.JPG
Exercise: Adapt the loop code to call wget multiple times, using the variable in the command to download each page of the PDF.
Zero padding
Zero padding (or adding extra 0's to the start of a numeric filename) is a good practice as it allows numerically named files to be lexically sorted (ie alphabetically) and be in the same order as if they were numerically sorted. In other words, without zero padding, the default ordering of files (alphabetical) places "10.jpg" before "1.jpg", while "01.jpg" and "10.jpg" will sort (alphabetically) as expected.
printf %04d 5
outputs
0005
Command substitution
There are two forms:
$(command)
and
`command`
Zero padding files after the fact
If you are unable to avoid pooly named filenames with the above, then you can fix them with the following command:
This uses the *rename* utility and regular expressions:
rename 's/\d+/sprintf("%04d",$&)/e' foo*
Source: https://stackoverflow.com/questions/55754/how-to-zero-pad-numbers-in-file-names-in-bash#4561801
stdout/stderr/stdin
Imagemagick!
See: ImageMagick
BASH tests
https://ss64.com/bash/test.html
A looping downloader with zero-padded filenames
#!/bin/bash for (( i=1; i<=7; i++ )) do echo Starting to download issue $i... mkdir -p `printf %02d $i` for (( p=1; p<=100; p++ )) do outname=`printf %02d $i`/`printf %02d $i`_`printf %02d $p`.jpg echo "Downloading $outname" wget https://web.archive.org/web/20041020175747/http://www.wps.com/archives/HOMOCORE/$i/$p.JPG -O $outname 2> /dev/null if [ ! -s $outname ] then echo No more pages... break fi done done
Resize all the images to be smaller
NB: the Mogrify command changes images in place (ie it destroys the originals). Just make a copy of your images first…
mkdir originals
cp *.JPG originals
then you could:
mogrify -resize 120x *.JPG
Imagemagick can do many things
See: http://www.imagemagick.org/Usage/
Make a single PDF of all the images
You can “bind” all the images together into a single PDF file with imagemagick:
convert *.JPG zine.pdf
Exercise:
Create a pdf with 100 numbered pages.
Running pdfsandwich….
pdfsandwich zine.pdf
Creates zine_ocr.pdf
NB the (sub) commands it uses:
identify -format "%w\n%h\n" "/tmp/pdfsandwich_tmp019949/pdfsandwich_inputfilee1562e.pdf[13]"
use verbose, gives much more output!
- Run on two images….
See pdfsandwich.log
The following is the result of running the command:
pdfsandwich -verbose test.pdf
pdfsandwich version 0.1.7
Checking for convert:
convert -version
Version: ImageMagick 6.9.10-23 Q16 x86_64 20190101 https://imagemagick.org
Copyright: © 1999-2019 ImageMagick Studio LLC
License: https://imagemagick.org/script/license.php
Features: Cipher DPC Modules OpenMP
Delegates (built-in): bzlib djvu fftw fontconfig freetype heic jbig jng jp2 jpeg lcms lqr ltdl lzma openexr pangocairo png tiff webp wmf x xml zlib
Checking for unpaper:
unpaper -V
6.1
Checking for tesseract:
tesseract -v
tesseract 4.0.0
leptonica-1.76.0
libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.2) : libpng 1.6.36 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found SSE
Checking for gs:
gs -v
GPL Ghostscript 9.27 (2019-04-04)
Copyright (C) 2018 Artifex Software, Inc. All rights reserved.
Checking for pdfinfo:
pdfinfo -v
Checking for pdfunite:
pdfunite -v
Input file: "zine.pdf"
Output file: "zine_ocr.pdf"
Number of pages in inputfile: 2
More threads than pages. Using 2 threads instead.
Parallel processing with 2 threads started.
Processing page order may differ from original page order.
Processing page 2.
Processing page 1.
identify -format "%w\n%h\n" "/tmp/pdfsandwich_tmpbd9242/pdfsandwich_inputfile6e1885.pdf[0]"
identify -format "%w\n%h\n" "/tmp/pdfsandwich_tmpbd9242/pdfsandwich_inputfile6e1885.pdf[1]"
convert -units PixelsPerInch -type Bilevel -density 300x300 "/tmp/pdfsandwich_tmpbd9242/pdfsandwich_inputfile6e1885.pdf[1]" /tmp/pdfsandwich_tmpbd9242/pdfsandwich8228d4.pbm
convert -units PixelsPerInch -type Bilevel -density 300x300 "/tmp/pdfsandwich_tmpbd9242/pdfsandwich_inputfile6e1885.pdf[0]" /tmp/pdfsandwich_tmpbd9242/pdfsandwich3d8ad8.pbm
unpaper --overwrite --no-grayfilter --layout none /tmp/pdfsandwich_tmpbd9242/pdfsandwich8228d4.pbm /tmp/pdfsandwich_tmpbd9242/pdfsandwich104659_unpaper.pbm
unpaper --overwrite --no-grayfilter --layout none /tmp/pdfsandwich_tmpbd9242/pdfsandwich3d8ad8.pbm /tmp/pdfsandwich_tmpbd9242/pdfsandwich7570dd_unpaper.pbm
Processing sheet #1: /tmp/pdfsandwich_tmpbd9242/pdfsandwich8228d4.pbm -> /tmp/pdfsandwich_tmpbd9242/pdfsandwich104659_unpaper.pbm
convert -units PixelsPerInch -density 300x300 /tmp/pdfsandwich_tmpbd9242/pdfsandwich104659_unpaper.pbm /tmp/pdfsandwich_tmpbd9242/pdfsandwiche7f914.tif
OMP_THREAD_LIMIT=1 tesseract /tmp/pdfsandwich_tmpbd9242/pdfsandwiche7f914.tif /tmp/pdfsandwich_tmpbd9242/pdfsandwich8ddf62 -l eng pdf
Processing sheet #1: /tmp/pdfsandwich_tmpbd9242/pdfsandwich3d8ad8.pbm -> /tmp/pdfsandwich_tmpbd9242/pdfsandwich7570dd_unpaper.pbm
convert -units PixelsPerInch -density 300x300 /tmp/pdfsandwich_tmpbd9242/pdfsandwich7570dd_unpaper.pbm /tmp/pdfsandwich_tmpbd9242/pdfsandwich33bb2b.tif
OMP_THREAD_LIMIT=1 tesseract /tmp/pdfsandwich_tmpbd9242/pdfsandwich33bb2b.tif /tmp/pdfsandwich_tmpbd9242/pdfsandwich70ba74 -l eng pdf
OCR pdf generated. Renaming output file to /tmp/pdfsandwich_tmpbd9242/pdfsandwichd39e00.pdf
OCR pdf generated. Renaming output file to /tmp/pdfsandwich_tmpbd9242/pdfsandwichbea9bf.pdf
OCR done. Writing "zine_ocr.pdf"
pdfunite /tmp/pdfsandwich_tmpbd9242/pdfsandwichd39e00.pdf /tmp/pdfsandwich_tmpbd9242/pdfsandwichbea9bf.pdf /tmp/pdfsandwich_tmpbd9242/pdfsandwich_outputc93999.pdf
zine_ocr.pdf generated.
Done.
Unpick each command, read the man pages
unpaper --help man unpaper
Pick apart the commands, can find the tesseract command:
OMP_THREAD_LIMIT=1 tesseract /tmp/pdfsandwich_tmpbd9242/pdfsandwiche7f914.tif /tmp/pdfsandwich_tmpbd9242/pdfsandwich8ddf62 -l eng pdf
What does this do??
- 1: Document each step of the process
- 3: rewrite the process to allow editing of the files before producing the final pdf
- 2: intervene to treat some pages differently…
evt use a makefile ?? (could be nice)
Tools
Other sources of scanned materials to choose from
- Factsheet 5
- https://archive.org/search.php?query=factsheet%20five
- https://archive.leftove.rs/documents/CLP
Platforms
- The Wayback Machine
- Project Gutenberg
- Wiki source … community portal and Introduction to editing
- Distributed proofreaders video
hocr2pdf
Reminder what standard in is… example pipeline. (TODO… maybe with pdftotext and grep!)
and what about this one: https://github.com/eloops/hocr2pdf and this one: https://github.com/concordusapps/python-hocr
in a python3 venv * pip install lxml # sudo apt install libxml2-dev libxslt-dev
Try adding other pages .. reordering pages…
(Eventually maybe look into pdftk)
ugh: hocr2pdf seems not to really work consistently
tools for deconstructing / editing pdf ???
todo ?! (instead of hocr) https://www.binpress.com/manipulate-pdf-python/
More bash commands like image magic to create a title page …. use pdftk or pdfunion to combine pages.
Use the scanner as input!!!!
Use scanimage (with brew ?!)
pdf2txt.py -t xml test_ocr.pdf