Digital zines I: PDF: Difference between revisions

From XPUB & Lens-Based wiki
No edit summary
 
(22 intermediate revisions by the same user not shown)
Line 30: Line 30:


In addition to a certain DIY aesthetic, doing things on the commandline is quite powerful, because whatever can be accomplished once, no matter how obscure the commands or effort to get a script to work, that script can then become part of a larger program (say in a loop) and form the basis of new novel publishing workflows.
In addition to a certain DIY aesthetic, doing things on the commandline is quite powerful, because whatever can be accomplished once, no matter how obscure the commands or effort to get a script to work, that script can then become part of a larger program (say in a loop) and form the basis of new novel publishing workflows.
== Installing tools ==
For the recipes on this page, you'll need some tools, namely:
* wget
* convert & mogrify
* tesseract
* pdfunite
These can be (respectively) installed with a package manager command like:
  sudo apt install wget imagemagick tesseract-ocr tesseract-ocr-eng poppler-utils
or
  brew install wget imagemagick tesseract-ocr poppler
== Getting help on a commandline program ==
Try running the program with either no options:
<pre>tesseract</pre>
or with "--help" or "-h"
<pre>tesseract --help</pre>
For more information check the man page:
<pre>man tesseract</pre>


== BASH Loops ==
== BASH Loops ==
Line 89: Line 119:


== Command substitution ==
== Command substitution ==
Like variable substitution, but with a command. You tend to see the `backticks` version more (it's less typing ;)


There are two forms:
There are two forms:
Line 96: Line 128:


[https://www.gnu.org/software/bash/manual/html_node/Command-Substitution.html Source]
[https://www.gnu.org/software/bash/manual/html_node/Command-Substitution.html Source]
So, for example, instead of:
<source lang="bash">
echo Hello $USER
</source>
you could use the *whoami* command for the same:
<source lang="bash">
echo Hello `whoami`
</source>


== Zero padding files after the fact ==
== Zero padding files after the fact ==
Line 109: Line 153:
== stdout/stderr/stdin ==
== stdout/stderr/stdin ==


* [[stdout]]/[[stdin]] /[[stderr]]
In addition to shell scripts and command substitutions, creating [[Pipelines]] is another way to make novel workflows by composing multiple tools/commands. See also: [[stdout]]/[[stdin]] /[[stderr]]
* >2 Save stderr as...
 
* > Save stdout as...
Remember that:
* > myfile ... Means Save stdout as myfile...
* >2 myfile .... Means Save stderr as myfile...


Try using ">2 /dev/null" to hide make the output of wget.
So using ">2 /dev/null" means send stderr to "dev/null" (in other words, don't do anything with it) to hide the output of a program (such as wget).


== Imagemagick! ==
== Imagemagick! ==
Line 171: Line 217:
== Make a single PDF of all the images ==
== Make a single PDF of all the images ==


You can “bind” all the images together into a single PDF file with imagemagick:
You can “bind” all the images together into a single (image only) PDF file with imagemagick:


<source lang="bash">convert *.JPG zine.pdf</source>
<source lang="bash">convert *.JPG zine.pdf</source>
== Exercise: ==
== Exercise: ==


Line 183: Line 230:
* https://www.iso.org/standard/63534.html
* https://www.iso.org/standard/63534.html


== 1: Using tesseract to convert a single jpeg into a pdf with searchable text ==
== Use tesseract to convert a single jpeg into a pdf with searchable text ==


For:
Tesseract has a particular way of being run,
 
  tesseract imagename outputbase [options...] [configfile...]
 
In particular, rather than giving an output filename, you need to give a output "base" name (the first part of the output file name), and then separately a "configuration" which basically defines what kind(s) of output you want to produce. So for instance to convert the single JPEG (01_01.jpg) into a PDF named (01_01.pdf), you would use the command:


   tesseract 01_01.jpg 01_01 pdf
   tesseract 01_01.jpg 01_01 pdf


NB: MAke sure you have a SPACE and not a dot with pdf, as in:
'''NB: There is a SPACE between 01_01 and pdf, if you would type:'''
  tesseract 01_01.jpg 01_01.pdf


Here it misunderstands the "basename" and defaults to text output, making the file:
  tesseract 01_01.jpg 01_01.pdf    # this is not right!
 
It's confusing because tesseract then uses as output "basename" 01_01.pdf, and defaults to text output, producing a text file output named:


01_01.pdf.txt
01_01.pdf.txt


== 2: Try to join 2 pdfs into one (test) ==
== Joining pdfs (and preserving searchability) ==
Unforunately using imagemagick to join different pdf files together removes any non-image material (such as tesseracts OCR output). Luckily pdfunite (part of the poppler package) is a program that can join searchable pdfs and preserves this information!
 
To check pdfunite on just two particular files, you might try:


   pdfunite 01_01.pdf 01_02.pdf output.pdf
   pdfunite 01_01.pdf 01_02.pdf output.pdf


== Use a wildcard ==
Then finally, you can use a [[wildcard]] to join all the single page pdfs together:


   pdfunite 01_*.pdf output.pdf
   pdfunite 01_*.pdf output.pdf


== ocr script ==
== ocr script ==
Now, doing the same with a loop... and finally joining the different pdfs into a single one (still with searchable text) (using pdfunite)...
Tesseract can only process one image at a time. This is not a bad thing. Following the command-line aesthetic of "doing one thing well" it does it's thing. It's then up to the intrepid shell scripter (that's you) to put tesseract commands into a loop to process a whole bunch of input files.
Using [[tesseract]] and maybe [[poppler-utils]]? or [[poppler]] (pdfunite) .... An alternative might be [[pdftk]]
 
So to combine the different steps above in a loop... and finally joining the different pdfs into a single one (still with searchable text) (using pdfunite)...


<pre>
<source lang="bash">
mkdir -p icons
mkdir -p icons
for i in *.jpg
for i in *.jpg
Line 217: Line 273:
done
done
pdfunite 01_*.pdf 01.pdf
pdfunite 01_*.pdf 01.pdf
</pre>
</source>


== Running pdfsandwich…. ==
== pdfsandwich ==


In the end we didn't use this, but [[pdfsandwich]] is a (super) script that calls a number of other commands:
In the end we didn't use this, but [[pdfsandwich]] is a command that just calls a number of other commands to turn an input pdf (without text information) into a pdf with text (aka searchable). It makes use of:


* imagemagick (convert): to extract images from a source PDF
* imagemagick (convert): to extract images from a source PDF
Line 291: Line 347:
Done.
Done.
</source>
</source>
= Getting help on a commandline program =
Try running the program with either no options:
<pre>unpaper</pre>
or with "--help" or "-h"
<pre>unpaper --help</pre>
For more information check the man page:
<pre>man unpaper</pre>


== Other sources of scanned materials to choose from ==
== Other sources of scanned materials to choose from ==
Line 319: Line 361:
* [https://www.pgdp.net/c/ Distributed proofreaders] [https://www.algolit.net/mundaneum/distributed_proofreaders/ video]
* [https://www.pgdp.net/c/ Distributed proofreaders] [https://www.algolit.net/mundaneum/distributed_proofreaders/ video]


= hocr2pdf =
Reminder what standard in is… example pipeline. (TODO… maybe with pdftotext and grep!)
PROTIP: https://github.com/tesseract-ocr/tesseract/wiki/FAQ#how-do-i-integrate-original-image-file-and-detected-text-into-pdf
and what about this one: https://github.com/eloops/hocr2pdf and this one: https://github.com/concordusapps/python-hocr
in a python3 venv * pip install lxml # sudo apt install libxml2-dev libxslt-dev
<!-- hocr 0.2.11 requires beautifulsoup==4.0, which is not installed.
hocr 0.2.11 requires chardet==dev, which is not installed.
hocr 0.2.11 requires filemagic, which is not installed.
hocr 0.2.11 requires hummus>=0.2.0, which is not installed.
hocr 0.2.11 has requirement lxml<4.0.0,>=3.2.3, but you'll have lxml 4.4.2 which is incompatible.
-->
Try adding other pages .. reordering pages…
(Eventually maybe look into pdftk)
ugh: hocr2pdf seems not to really work consistently
= tools for deconstructing / editing pdf ??? =
* [[pdftk]]
todo ?! (instead of hocr) https://www.binpress.com/manipulate-pdf-python/
* [https://pypi.org/project/pdfminer/ pdfminer]… maybe try pdfdump command…
* [https://pythonhosted.org/PyPDF2/ pypdf2]
More bash commands like image magic to create a title page …. use pdftk or pdfunion to combine pages.
Use the scanner as input!!!!
Use scanimage (with brew ?!)
pdf2txt.py -t xml test_ocr.pdf


Followed by: [[Digital zines II: HTML and friends]]
[[Category:Post-Digital Itch]]
[[Category:Post-Digital Itch]]

Latest revision as of 15:52, 31 January 2020

PDF zining

Today: digital zine making with PDF (using the commandline and bash scripting)

A Zine Protocol?

In a blog post of January 2020, titled Non-realtime publishing for censorship resistance, the author, Björn Edström, ironically(?) proposes zines as a novel new technical protocol (or “class of services”) as an alternative for traditional online publishing, which he sees as fragile and too easily censored.

The main idea is that you have one or more owners. Think of an owner as a public/private keypair. The owner manipulate (offline) a collection of files (a “zine”). This collection is encrypted, signed and published periodically. It doesn’t matter where they are published, could be anywhere on the Internet, in an actual magazine, or maybe on a Tor hidden service. It doesn’t matter.

The users can perform actions on the zine by relaying (somehow) requests to the owners. The owners may then chose to apply the action and thus manipulate/update the collection, and then publish a new collection/zine for consumption by the users. In practice they will probably have a little application for manipulating the zine, like a command line tool, text editor, or something, that handles the formatting/marshalling.

Note: Marshalling, a term with a military origin, in a techincal sense is about packaging digital stuff in a way that can be transmitted and shared. source

Post-digital

In the post, Edström is speaking to a technical audience warning of the “centralizing” tendencies and fragility of many technical solutions, despite the fact that online publishing is often portrayed as inherently decentralized. You could see this posting as part of larger contemporary phenonenon of a backlash against many of the promises of the Web and the Internet (what could be called “Web 2.0”), as, in this case, a programmer realizes that despite the conventional wisdom, for instance: the Internet being inherhently de-centralized and thus somehow impervious to censorship, and recalls that earlier technologies/media/publishing practices may in fact already have done a better job. This movement could be termed part of what’s called “post-digital”. In the case of the blog post, what’s not clear in the end however, is whether Edström is interested in the actual history of zine publishing, and whether he thinks there might be specific lessons might be taken from this already established practice.

DIY: The rise of Lo-Fi Culture, Amy Spencer

Amy Spencer’s DIY: The rise of Lo-Fi Culture (pdf) is a telling of the history of zines, which she defines as “non-commercial, small-circulation publications which are producedand distributed by their creators”.

Tom Jennings

Tom Jennings is an interesting figure at an intersection of two histories. First, he is one of the key authors of homocore, a queer zine described by Spencer. Second, Jennings is the author of the fidonet protocol and plays an important role in the history of early bulletin board systems. Jennings is featured in “BBS documentary”, a project of Jason Scott of the project archive team

bbs: baud bbs: sysops bbs: fidonet

Commandline

In addition to a certain DIY aesthetic, doing things on the commandline is quite powerful, because whatever can be accomplished once, no matter how obscure the commands or effort to get a script to work, that script can then become part of a larger program (say in a loop) and form the basis of new novel publishing workflows.

Installing tools

For the recipes on this page, you'll need some tools, namely:

  • wget
  • convert & mogrify
  • tesseract
  • pdfunite

These can be (respectively) installed with a package manager command like:

 sudo apt install wget imagemagick tesseract-ocr tesseract-ocr-eng poppler-utils

or

 brew install wget imagemagick tesseract-ocr poppler

Getting help on a commandline program

Try running the program with either no options:

tesseract

or with "--help" or "-h"

tesseract --help

For more information check the man page:

man tesseract

BASH Loops

BASH is the so called "Bourne again" shell -- aka the command line -- and it's a scripting language.

In programming there are only 3 imporant cases: zero, one, and more than one

When you have “more than one” thing – think loop. In BASH there are loops modelled on how they work in C.

#!/bin/bash
for (( c=1; c&lt;=5; c++ ))
do  
   echo &quot;Welcome $c times&quot;
done

You can also nest loops. It may be useful to think of a clock when nesting loops -- the outer loop is the "slower" of the two (hour hand) compared to the faster inner loop (minute hand).

for (( h=0; h<=23; h++ ))
do

	for (( m=0; m<=59; m++ ))
	do
	echo "The time is: $h:$m"
	done

done

Concrete example: Downloading an image from online

A useful commandline tool for downloading things that are online is wget. (See also: curl)

Let’s say we want to download all the images from homocore issue #1. They are (happily, but precariously) still available via links held in the “wayback” machine hosted by archive.org.

Downloading the images:

https://web.archive.org/web/20041020175747/http://www.wps.com/archives/HOMOCORE/1.html

Following the link:

<https://web.archive.org/web/20041020175747/http://www.wps.com/archives/HOMOCORE/1/1.JPG>

try with wget:

wget https://web.archive.org/web/20041020175747/http://www.wps.com/archives/HOMOCORE/1/1.JPG

Exercise: Adapt the loop code to call wget multiple times, using the variable in the command to download each page of the PDF.

Zero padding

Zero padding (or adding extra 0's to the start of a numeric filename) is a good practice as it allows numerically named files to be lexically sorted (ie alphabetically) and be in the same order as if they were numerically sorted. In other words, without zero padding, the default ordering of files (alphabetical) places "10.jpg" before "1.jpg", while "01.jpg" and "10.jpg" will sort (alphabetically) as expected.

   printf %04d 5

outputs

   0005

Command substitution

Like variable substitution, but with a command. You tend to see the `backticks` version more (it's less typing ;)

There are two forms:

$(command)

and

`command`

Source

So, for example, instead of:

echo Hello $USER

you could use the *whoami* command for the same:

echo Hello `whoami`

Zero padding files after the fact

If you are unable to avoid pooly named filenames with the above, then you can fix them with the following command:

This uses the *rename* utility and regular expressions:

rename 's/\d+/sprintf("%04d",$&)/e' foo*

Source: https://stackoverflow.com/questions/55754/how-to-zero-pad-numbers-in-file-names-in-bash#4561801

stdout/stderr/stdin

In addition to shell scripts and command substitutions, creating Pipelines is another way to make novel workflows by composing multiple tools/commands. See also: stdout/stdin /stderr

Remember that:

  • > myfile ... Means Save stdout as myfile...
  • >2 myfile .... Means Save stderr as myfile...

So using ">2 /dev/null" means send stderr to "dev/null" (in other words, don't do anything with it) to hide the output of a program (such as wget).

Imagemagick!

See: ImageMagick

BASH tests

https://ss64.com/bash/test.html

A looping downloader with zero-padded filenames

#!/bin/bash
for (( i=1; i<=7; i++ ))
do
	echo Starting to download issue $i...
	mkdir -p `printf %02d $i`
	for (( p=1; p<=100; p++ ))
	do  
	   outname=`printf %02d $i`/`printf %02d $i`_`printf %02d $p`.jpg
	   echo "Downloading $outname"
	   wget https://web.archive.org/web/20041020175747/http://www.wps.com/archives/HOMOCORE/$i/$p.JPG -O $outname 2> /dev/null
	   if [ ! -s $outname ]
	   then
	       echo No more pages...
               rm $outname
	       break
	   fi
	done
done

Resize all the images to be smaller

In the case that your images are too high resolution (which isn't actually the case with the images from the homocore zine), you can use imagemagick to resize them.

NB: the Mogrify command changes images in place (ie it destroys the originals). Just make a copy of your images first…

mkdir originals
cp *.JPG originals

then you could:

mogrify -resize 120x *.JPG

Alternatively, you could use convert in a loop:

mkdir -p icons
for i in *.JPG
do
	echo making a thumbnail of $i
	convert $i -resize 200x200 icons/$i
done

Make a single PDF of all the images

You can “bind” all the images together into a single (image only) PDF file with imagemagick:

convert *.JPG zine.pdf

Exercise:

Create a pdf with 100 numbered pages.

PDF

Use tesseract to convert a single jpeg into a pdf with searchable text

Tesseract has a particular way of being run,

 tesseract imagename outputbase [options...] [configfile...]

In particular, rather than giving an output filename, you need to give a output "base" name (the first part of the output file name), and then separately a "configuration" which basically defines what kind(s) of output you want to produce. So for instance to convert the single JPEG (01_01.jpg) into a PDF named (01_01.pdf), you would use the command:

 tesseract 01_01.jpg 01_01 pdf

NB: There is a SPACE between 01_01 and pdf, if you would type:

 tesseract 01_01.jpg 01_01.pdf     # this is not right!

It's confusing because tesseract then uses as output "basename" 01_01.pdf, and defaults to text output, producing a text file output named:

01_01.pdf.txt

Joining pdfs (and preserving searchability)

Unforunately using imagemagick to join different pdf files together removes any non-image material (such as tesseracts OCR output). Luckily pdfunite (part of the poppler package) is a program that can join searchable pdfs and preserves this information!

To check pdfunite on just two particular files, you might try:

  pdfunite 01_01.pdf 01_02.pdf output.pdf

Then finally, you can use a wildcard to join all the single page pdfs together:

  pdfunite 01_*.pdf output.pdf

ocr script

Tesseract can only process one image at a time. This is not a bad thing. Following the command-line aesthetic of "doing one thing well" it does it's thing. It's then up to the intrepid shell scripter (that's you) to put tesseract commands into a loop to process a whole bunch of input files.

So to combine the different steps above in a loop... and finally joining the different pdfs into a single one (still with searchable text) (using pdfunite)...

mkdir -p icons
for i in *.jpg
do
	echo ocring $i...
	convert $i -resize 200x200 icons/$i
	tesseract $i `basename -s .jpg $i` pdf
done
pdfunite 01_*.pdf 01.pdf

pdfsandwich

In the end we didn't use this, but pdfsandwich is a command that just calls a number of other commands to turn an input pdf (without text information) into a pdf with text (aka searchable). It makes use of:

  • imagemagick (convert): to extract images from a source PDF
  • unpaper: to "fix" / clean up a scanned image to work
  • tesseract: to do ocr and produce a single page pdf
  • pdfunite: to "rebind" the single page pdfs back into a multi-page pdf.

When you run the command in "verbose" mode, the script outputs a "trace" of the commands it's using (a bit like using the -x option on the bash command):

pdfsandwich -verbose test.pdf
pdfsandwich version 0.1.7
Checking for convert:
convert -version
Version: ImageMagick 6.9.10-23 Q16 x86_64 20190101 https://imagemagick.org
Copyright: © 1999-2019 ImageMagick Studio LLC
License: https://imagemagick.org/script/license.php
Features: Cipher DPC Modules OpenMP 
Delegates (built-in): bzlib djvu fftw fontconfig freetype heic jbig jng jp2 jpeg lcms lqr ltdl lzma openexr pangocairo png tiff webp wmf x xml zlib
Checking for unpaper:
unpaper -V
6.1
Checking for tesseract:
tesseract -v
tesseract 4.0.0
 leptonica-1.76.0
  libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.2) : libpng 1.6.36 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found AVX2
 Found AVX
 Found SSE
Checking for gs:
gs -v
GPL Ghostscript 9.27 (2019-04-04)
Copyright (C) 2018 Artifex Software, Inc.  All rights reserved.
Checking for pdfinfo:
pdfinfo -v
Checking for pdfunite:
pdfunite -v
Input file: "zine.pdf"
Output file: "zine_ocr.pdf"
Number of pages in inputfile: 2
More threads than pages. Using 2 threads instead.

Parallel processing with 2 threads started.
Processing page order may differ from original page order.

Processing page 2.
Processing page 1.
identify -format "%w\n%h\n"  "/tmp/pdfsandwich_tmpbd9242/pdfsandwich_inputfile6e1885.pdf[0]" 
identify -format "%w\n%h\n"  "/tmp/pdfsandwich_tmpbd9242/pdfsandwich_inputfile6e1885.pdf[1]" 
convert -units PixelsPerInch  -type Bilevel -density 300x300  "/tmp/pdfsandwich_tmpbd9242/pdfsandwich_inputfile6e1885.pdf[1]" /tmp/pdfsandwich_tmpbd9242/pdfsandwich8228d4.pbm
convert -units PixelsPerInch  -type Bilevel -density 300x300  "/tmp/pdfsandwich_tmpbd9242/pdfsandwich_inputfile6e1885.pdf[0]" /tmp/pdfsandwich_tmpbd9242/pdfsandwich3d8ad8.pbm
unpaper --overwrite  --no-grayfilter --layout none /tmp/pdfsandwich_tmpbd9242/pdfsandwich8228d4.pbm /tmp/pdfsandwich_tmpbd9242/pdfsandwich104659_unpaper.pbm
unpaper --overwrite  --no-grayfilter --layout none /tmp/pdfsandwich_tmpbd9242/pdfsandwich3d8ad8.pbm /tmp/pdfsandwich_tmpbd9242/pdfsandwich7570dd_unpaper.pbm
Processing sheet #1: /tmp/pdfsandwich_tmpbd9242/pdfsandwich8228d4.pbm -> /tmp/pdfsandwich_tmpbd9242/pdfsandwich104659_unpaper.pbm
convert -units PixelsPerInch -density 300x300 /tmp/pdfsandwich_tmpbd9242/pdfsandwich104659_unpaper.pbm /tmp/pdfsandwich_tmpbd9242/pdfsandwiche7f914.tif
OMP_THREAD_LIMIT=1 tesseract /tmp/pdfsandwich_tmpbd9242/pdfsandwiche7f914.tif /tmp/pdfsandwich_tmpbd9242/pdfsandwich8ddf62  -l eng pdf 
Processing sheet #1: /tmp/pdfsandwich_tmpbd9242/pdfsandwich3d8ad8.pbm -> /tmp/pdfsandwich_tmpbd9242/pdfsandwich7570dd_unpaper.pbm
convert -units PixelsPerInch -density 300x300 /tmp/pdfsandwich_tmpbd9242/pdfsandwich7570dd_unpaper.pbm /tmp/pdfsandwich_tmpbd9242/pdfsandwich33bb2b.tif
OMP_THREAD_LIMIT=1 tesseract /tmp/pdfsandwich_tmpbd9242/pdfsandwich33bb2b.tif /tmp/pdfsandwich_tmpbd9242/pdfsandwich70ba74  -l eng pdf 
OCR pdf generated. Renaming output file to /tmp/pdfsandwich_tmpbd9242/pdfsandwichd39e00.pdf

OCR pdf generated. Renaming output file to /tmp/pdfsandwich_tmpbd9242/pdfsandwichbea9bf.pdf

OCR done. Writing "zine_ocr.pdf"
pdfunite /tmp/pdfsandwich_tmpbd9242/pdfsandwichd39e00.pdf /tmp/pdfsandwich_tmpbd9242/pdfsandwichbea9bf.pdf /tmp/pdfsandwich_tmpbd9242/pdfsandwich_outputc93999.pdf

zine_ocr.pdf generated.

Done.

Other sources of scanned materials to choose from

Platforms


Followed by: Digital zines II: HTML and friends