User:Riviera/Podcast rss: Difference between revisions
No edit summary |
|||
Line 1: | Line 1: | ||
__TOC__ | |||
<span id="podcasts-rss-feeds-grep-and-sed"></span> | |||
= Podcasts are RSS feeds = | |||
< | <syntaxhighlight lang="fish">$ wget http://feeds.libsyn.com/330110/rss | ||
</ | </syntaxhighlight> | ||
On Monday 2nd, we briefly discussed podcasts as the antithesis of radio. That radio is live whereas podcasts are prerecorded and that modes of engaging with podcasts and radio differ. Podcasts are built upon Really Simple Syndication (RSS). In other words, in terms of code, there's little difference between the XML for an blog feed and the XML for a podcast feed. The following command combines a regular expression with the grep command to retrieve a list of some of the opening XML tags in RSS feed for the ''Call Me Mother'' podcast. | |||
<syntaxhighlight lang="fish">$ grep -E "<[[:alpha:]]+" rss | |||
</syntaxhighlight> | |||
This command prints results such as | This command prints results such as | ||
<syntaxhighlight lang="xml"><pubDate>Fri, 02 Apr 2021 15:25:34 GMT</pubDate> | |||
</ | </syntaxhighlight> | ||
<syntaxhighlight lang="xml"><title>Stephen Whittle</title> | |||
</syntaxhighlight> | |||
</ | |||
and | and | ||
< | <syntaxhighlight lang="xml"><link>https://www.novel.audio</link> | ||
</ | </syntaxhighlight> | ||
These tags also appear in RSS feeds for written blogs. However, the command also prints results such as | These tags also appear in RSS feeds for written blogs. However, the command also prints results such as | ||
< | <syntaxhighlight lang="xml"><itunes:explicit>yes</itunes:explicit> | ||
</ | </syntaxhighlight> | ||
and | and | ||
< | <syntaxhighlight lang="xml"><acast:showId>62b087ec4f1d1f0014025b79</acast:showId> | ||
</syntaxhighlight> | |||
= Editing = | |||
What I want to retrieve is a list of the tags only. At the moment, I'm not interested in what's in between the tags. Ideally I'd like to use built in commands to generate a text which contains an outline of tags along the lines of the following listing. | |||
< | <syntaxhighlight lang="xml">1 <tag> | ||
2 <subtag> | |||
3 </subtag> | |||
4 </tag> | |||
</syntaxhighlight> | |||
<span id="first-attempt"></span> | |||
== First Attempt == | |||
Can this be achieved by making two files, <code>open.txt</code> and <code>close.txt</code>? The first file should contain all the opening tags and the latter all the closing tags. | |||
< | |||
<syntaxhighlight lang="bash">$ grep -Eon "<[[:alpha:]]+>" rss > open.txt | |||
$ grep -Eon "</[[:alpha:]]+>" rss > close.txt | |||
</syntaxhighlight> | |||
< | It should then be possible to <code>cat</code> both files and <code>sort</code> the result by line number producing the desired outcome. | ||
$ grep -Eon < | |||
</ | |||
It should then be possible to < | |||
<syntaxhighlight lang="shell">$ cat open.txt close.txt | sort -n > sorted.txt && cat sorted.txt | less | |||
</syntaxhighlight> | |||
<span id="second-attempt"></span> | |||
== Second Attempt == | == Second Attempt == | ||
Whilst reading through the output of the first attempt (<code>sorted.txt</code>), I discovered that | |||
#That the closing tags were sorted before the opening tags | # the <code></guid></code> tags had no correlating opening tag | ||
# That the closing tags were sorted before the opening tags | |||
<span id="problem-1-some-information-was-missing"></span> | |||
=== Problem 1: Some information was missing === | === Problem 1: Some information was missing === | ||
I had already deliberately omitted <code>acast</code> and <code>itunes</code> tags. In doing so I worked on the assumption that there were no other relevant, colon-separated tags in the <code>rss</code> file. Fortunately, retrieving the additional data was a quick fix: | |||
<syntaxhighlight lang="shell">$ grep -Eon "<[[:alpha:]]+>|<[[:alpha:]]+ [^z-A]*>" rss > open.txt | |||
</syntaxhighlight> | |||
Figuring out a way to sort the file such that the closing tags followed the opening tags was another matter. The general outline was there, but if I could sort out the details this could become a template for a podcast RSS generator. And that could potentially be useful in relation to Worm's sonic archive. | Figuring out a way to sort the file such that the closing tags followed the opening tags was another matter. The general outline was there, but if I could sort out the details this could become a template for a podcast RSS generator. And that could potentially be useful in relation to Worm's sonic archive. | ||
=== Problem 2: Adding whitespace with < | <span id="problem-2-adding-whitespace-with-sed"></span> | ||
=== Problem 2: Adding whitespace with <code>sed</code> === | |||
The following command prepends a new line to each instance of '</' in the <code>rss</code> file. | |||
<syntaxhighlight lang="shell">sed 's/<\\//\n<\\//' rss >rss-out; | |||
</syntaxhighlight> | |||
Now it's time to write the <code>open.txt</code> and <code>close.txt</code> files. | |||
<syntaxhighlight lang="shell">$ grep -Eon "<[[:alpha:]]+>|<[[:alpha:]]+ [^z-A]*>" rss-out > open.txt | |||
$ grep -Eon "</[[:alpha:]]+>>" rss-out > close.txt | |||
</syntaxhighlight> | |||
Placing a newline before each closing tag increases the line number on which each closing tag appears. Now it's time to concatenate <code>open.txt</code> and <code>close.txt</code> and view the output. | |||
<syntaxhighlight lang="shell">$ cat open.txt close.txt | sort -n > sorted.txt && cat sorted.txt | less | |||
</syntaxhighlight> | |||
<span id="third-attempt"></span> | |||
== Third Attempt == | == Third Attempt == | ||
On closer inspection of the output of sorted.txt I noticed a further lack of information. Various closing anchor tags had no corresponding opening tag. I therefore had to write a regular expression capable of matching the additional data in these tags. This might make the output of <code>sorted.txt</code> a little less readable, Although there may be a way to tidy up this information. For completeness, it would also make sense to include the <code>acast</code> and <code>itunes</code> tags. | |||
<syntaxhighlight lang="shell">$ grep -Eon "<[[:alpha:]]+>|<[[:alpha:]]+ [^z-A]*>|<[[:alpha:]]+ [[:alpha:]]+([=\":/'.;_ ]|[[:alnum:]])*>|<itunes:[[:alpha:]]+>|<acast:[[:alpha:]]+>" rss-out > open.txt | |||
$ grep -Eon "</[[:alpha:]]+>|</itunes:[[:alpha:]]+>|</acast:[[:alpha:]]+>" rss-out > close.txt | |||
</syntaxhighlight> | |||
Then, as above | Then, as above | ||
<syntaxhighlight lang="shell">$ cat open.txt close.txt | sort -n > sorted.txt && cat sorted.txt | less | |||
</syntaxhighlight> | |||
<span id="fourth-attempt"></span> | |||
== Fourth Attempt == | == Fourth Attempt == | ||
I refined the regular expression. It now matches every opening and closing tag. Also, it's necessary to pass the global option to the sed substitution command. The second sed command prepends whitespace to every opening tag. This improves the structure of the output | |||
<syntaxhighlight lang="shell">sed 's/<\\//\n<\\//g' rss >rss-out | |||
sed "s/<\([[:alpha:]][^>]*>\)/\n<\1/g" rss-out > rss-out-out | |||
grep -Eon "<[[:alpha:]]([^>]*>)" rss-out-out > open.txt | |||
grep -Eon "</([^>]*>)" rss-out-out > close.txt | |||
cat open.txt close.txt | sort -n > sorted.txt | |||
grep -Eo "<([^>]*>)" sorted.txt | less > skeleton.xml | |||
chromium sorted.xml | |||
</syntaxhighlight> | |||
The webpage complains about unmatched horizontal rule tags. I deleted all of these using the query replace function in Emacs. The webpage subsequently complained about <code><br></code> and <code><br /></code> tags, so I deleted all of these too. The result is what I wanted. It views well in the browser; I am able to collapse tags for an abbreviated overview of the document. | |||
<span id="fleshing-out-the-skeleton"></span> | |||
= Fleshing out the skeleton = | = Fleshing out the skeleton = | ||
# | <code>item</code> tags are the primary constituent parts of the <code>channel</code> tag in <code>skeleton.xml</code>. However, the channel tag also contains several other tags. I have decided to design the podcast generator around this feature. This raises questions about the possible options the command might take. How does the user provide input? There are three aspects to the overall design. | ||
#Creating a system for generating channel information | |||
#Creating a system for adding items to a document containing channel information. | # creating a system for generating items | ||
# Creating a system for generating channel information | |||
# Creating a system for adding items to a document containing channel information. | |||
'''Wishlist''' | '''Wishlist''' | ||
*Create a system for updating channel information. | * Create a system for updating channel information. | ||
*skeleton | * skeleton –add foo.mp3 bar.xml | ||
<span id="a-closer-look-at-items"></span> | |||
== A closer look at items == | == A closer look at items == | ||
Each Item tag is made up of 15 tags. Many of these belong to the < | |||
< | Each Item tag is made up of 15 tags. Many of these belong to the <code>itunes</code> schema. | ||
<syntaxhighlight lang="xml"><itunes:title> | |||
<itunes:duration> | |||
<itunes:explicit> | |||
<itunes:episodeType> | |||
<itunes:season> | |||
<itunes:episode> | |||
<itunes:image href="https://assets.pippa.io/shows/62b087ec4f1d1f0014025b79/show-cover.jpg"/> | |||
</ | <itunes:summary> | ||
Others belong to the < | </syntaxhighlight> | ||
< | Others belong to the <code>acast</code> schema. | ||
<syntaxhighlight lang="xml"><acast:episodeId> | |||
<acast:showId> | |||
<acast:episodeUrl> | |||
</ | <acast:settings> | ||
</syntaxhighlight> | |||
And then there's the rest. | And then there's the rest. | ||
== The | <syntaxhighlight lang="xml"><title> | ||
The following shell script defines the function < | |||
<pubDate> | |||
<enclosure url="https://sphinx.acast.com/p/open/s/62b087ec4f1d1f0014025b79/e/62f2226a61f23900137394a3/media.mp3" length="56168488" type="audio/mpeg"/> | |||
<guid isPermaLink="false"> | |||
<description> | |||
<link> | |||
</syntaxhighlight> | |||
I intend only to make a basic podcast generator, so I'm not going to focus on the itunes and acast schemas. | |||
<span id="the-script"></span> | |||
== The script == | |||
<span id="general-outline"></span> | |||
=== General Outline === | |||
The following shell script defines the function <code>skeleton</code>. It takes a directory as an argument, asks the user for some information and writes an xml file. | |||
<syntaxhighlight lang="fish">#!/usr/bin/fish | |||
function skeleton -d "Generate an RSS channel for a podcast" -a directory | |||
set -l showid (uuidgen); | |||
set -l options (fish_opt -s h -l help); | |||
argparse --name='skeleton' 'h/help' -- $argv; | |||
mkdir -p /tmp/skeleton/{$showid}/e/; | |||
set -l channelfile /tmp/skeleton/{$showid}/rss.xml; | |||
# write the channel data | |||
echo '<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">' > $channelfile; | |||
echo '<channel>' >> $channelfile; | |||
echo '<ttl>60</ttl>' >> $channelfile; | |||
echo '<generator>skeleton</generator>' >> $channelfile; | |||
read -l channelname -P "Channel Title: "; | |||
echo '<title>'$channelname'</title>' >> $channelfile; | |||
echo '<link>https://hub.xpub.nl/chopchop/worm/</link>' >> $channelfile; | |||
echo '<atom:link href="hub.xpub.nl/chopchop/worm/'$showid'" ref="self" type="application/rss+xml"/>' >> $channelfile; | |||
echo '<language>en</language>' >> $channelfile; | |||
read -l channeldesc -P "Channel Description: "; | |||
echo '<description><![CDATA['$channeldesc']]></description>' >> $channelfile; | |||
echo '<image><url>https://hub.xpub.nl/chopchop/worm/'$showid'/image.jpg</url>' >> $channelfile; | |||
echo '<link>https://hub.xpub.nl/chopchop/worm/</link>' >> $channelfile; | |||
echo '<title>'$channelname'</title>' >> $channelfile; | |||
echo '</image>' >> $channelfile; | |||
# write the item data | |||
for file in (ls $argv); | |||
set -l guid (sha256sum {$argv}/{$file} | grep -Eo "[[:alnum:]]{64}"); | |||
mkdir -p /tmp/skeleton/{$showid}/e/{$guid}; | |||
ln -s {$argv}/{$file} /tmp/skeleton/{$showid}/e/{$guid}/{$file}; | |||
set i (math $i + 1); | |||
set -l itemfile (printf '/tmp/skeleton/%s/e/%s/item-%i' $showid $guid $i); | |||
echo "<item>" > $itemfile; | |||
read -l title -P "Item $i Title: "; | |||
read -l desc -P "Item $i Description: "; | |||
echo "<title>"$title"</title>" >> $itemfile; | |||
echo "<pubDate>"(date)"</pubDate>" >> $itemfile; | |||
echo '<enclosure url="https://hub.xpub.nl/chopchop/worm/'$showid'/e/'$guid'/'$file'" length="'(soxi -D {$argv}/{$file})'" type="'(file -b --mime-type {$argv}/{$file})'"/>' >> $itemfile; | |||
sed -i 's/\\/\\/[^.]*\(\\/[^\\/]*.mp3\)/\1/' $itemfile; | |||
echo '<guid isPermaLink="false">'$guid'</guid>' >> $itemfile; | |||
set -l link (grep -Eo "https://[^.].mp3" $itemfile); | |||
echo '<link>'$link'</link>' >> $itemfile; | |||
echo '<description><![CDATA['$desc']]></description>' >> $itemfile; | |||
echo '</item>' >> $itemfile; | |||
end | |||
# concatenate items in reverse order and append to the channel file | |||
for item in (ls -t /tmp/skeleton/*/e/*/item-*); | |||
cat $item >> $channelfile; | |||
end | |||
# close the rss and channel tags | |||
echo '</channel>' >> $channelfile; | |||
echo '</rss>' >> $channelfile; | |||
end | |||
</syntaxhighlight> | |||
<span id="analysis-of-the-script"></span> | |||
=== Analysis of the script === | |||
<ol> | |||
<li><p>Lines 1 - 6</p> | |||
<syntaxhighlight lang="fish">#!/usr/bin/fish | |||
function skeleton -d "Generate an RSS channel for a podcast" -a directory | |||
set -l showid (uuidgen); | |||
set -l options (fish_opt -s h -l help); | |||
argparse --name='skeleton' 'h/help' -- $argv; | |||
mkdir -p /tmp/skeleton/{$showid}/e/; | |||
</syntaxhighlight> | |||
<p>The function <code>skeleton</code> is given a description and takes a <code>directory</code> as an argument. Next two local variables are set <code>showid</code> and <code>options</code>. The former has the value of evaluating the command <code>uuidgen</code>. The latter allows for the user to pass <code>-h</code> or <code>--help</code> to <code>skeleton</code> which will display help information. I need to figure out how to enable more options than 'help' alone. <code>argparse</code> is a command which defines how the <code>skeleton</code> command takes both options and arguments. The script next calls the command <code>mkdir</code> to create a temporary file structure. The <code>showid</code> variable expands to the value which was set in line 3.</p> | |||
<ol> | |||
<li><p>Improvements</p> | |||
<syntaxhighlight lang="fish">#!/usr/bin/fish | |||
function skeleton -d "Generate an RSS channel for a podcast" -a channel | |||
set -l options (fish_opt -s h -l help); | |||
set options $options (fish_opt --short=g --long=generate --required-val); | |||
set options $options (fish_opt --short=a --long=add --required-val --multiple-vals); | |||
argparse $options -- $argv; | |||
or return | |||
if set -ql _flag_help | |||
echo "skeleton [ -g DIR | -a FILE | -h ] CHANNEL | |||
-h --help Display this text | |||
-g --genenerate DIR Generate an RSS feed for DIR | |||
-a --add FILE Add an item to a channel | |||
Skeleton is a command line tool for generating and writing RSS feeds for podcasts." | |||
return 0 | |||
end | |||
if set _flag_generate and set _flag_add; | |||
echo 'ERROR: skeleton cannot add and generate simultaneously' | |||
return 1 | |||
end | |||
if set -ql _flag_generate | |||
# echo $_flag_generate | |||
end | |||
if set -ql _flag_add | |||
# echo $_flag_add | |||
# | |||
# | |||
end | end | ||
end | |||
</syntaxhighlight> | |||
<p>I have adjusted the code such that different flags create different ways of interacting with the command. I have added three flags, <code>-h</code>, <code>-g</code> and <code>-a</code>, in full <code>--help</code>, <code>--generate</code> and <code>--add</code>. These are stored in the <code>options</code> variable. The <code>add</code> flag can be passed to <code>skeleton</code> multiple times. Furthermore, <code>skeleton</code> now takes a channel as an argument. This is more flexible than taking a directory as an argument. I have made use of <code>if</code> statements to make the function do things under certain conditions. The first if statement relates to the <code>help</code> flag. If that flag is passed to <code>skeleton</code>, then text detailing how to use the command is displayed. The second if statement prevents the user from adding items to rss feeds and generating rss feeds simultaneously. <code>--add</code> and <code>--generate</code> become mutually exclusive options. The remaining if statements control what the programme does when the <code>generate</code> and <code>add</code> flags are passed to <code>skeleton</code>. Some lines have been omitted because it makes sense to place them elsewhere following these changes.</p></li></ol> | |||
</li> | |||
<li><p>Lines 7 - 15</p> | |||
<syntaxhighlight lang="fish">set -l channelfile /tmp/skeleton/{$showid}/rss.xml; | |||
# write the channel data | |||
echo '<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">' > $channelfile; | |||
echo '<channel>' >> $channelfile; | |||
echo '<ttl>60</ttl>' >> $channelfile; | |||
echo '<generator>skeleton</generator>' >> $channelfile; | |||
</syntaxhighlight> | |||
<p>Previously, the programme took a directory as an argument. Now it takes a channel as an argument. Taking a directory as an argument resembles the <code>generate</code> option. Consequently, much of the code from before can be placed within a newly written if statement. This extends to, with some adjustments, some of the lines which were omitted above.</p> | |||
<ol> | |||
<li><p>Improvements</p> | |||
<syntaxhighlight lang="fish">set -l showid (sha256sum {$_flag_generate} | grep -Eo "[[:alnum:]]{64}"); | |||
mkdir -p /tmp/skeleton/{$showid}; | |||
</syntaxhighlight> | |||
<p>Due to the resemblance described above the remainder of the script will be placed inside the third if statement. Some changes can be made to the way in which the value of <code>showid</code> is determined. Rather than assigning a random id with <code>uuidgen</code> the <code>sha256sum</code> of the value of <code>_flag_generate</code>, a directory, is used. The <code>grep</code> command ensures that only the <code>sha256sum</code> is assigned to the value of <code>showid</code>. The full result of calling the command <code>sha256sum</code> on the value of <code>_flag_generator</code> contains extra information that needs to be filtered out. The code goes on to create a temporary directory using the value of the <code>showid</code> variable.</p></li> | |||
<li><p>Keeping track of things for later</p> | |||
<p>Somehow, when using the <code>add</code> flag, <code>skeleton</code> will need to retrieve the value of <code>showid</code> for a given channel. This will ensure coherent and consistent output. I'm going to write a file which pairs the <code>showid</code> with the value of <code>flag_generate</code>. The code for the <code>add</code> flag will consult this file when it sets the value of <code>showid</code>.</p> | |||
<syntaxhighlight lang="fish">echo $_flag_generate $showid >> /tmp/skeleton/index | |||
</syntaxhighlight></li> | |||
<li><p>Further improvements</p> | |||
<syntaxhighlight lang="fish">set -l channelfile /tmp/skeleton/{$showid}/rss.xml; | |||
# write the channel data | |||
echo '<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">' > $channelfile; | |||
echo '<channel>' >> $channelfile; | |||
echo '<ttl>60</ttl>' >> $channelfile; | |||
echo '<generator>skeleton</generator>' >> $channelfile; | |||
</syntaxhighlight> | |||
<p>A <code>channelfile</code> is then set. Redirecting the output of <code>echo</code> to the value of <code>channelfile</code> creates the file and stores the output there.</p> | |||
<syntaxhighlight lang="fish">read -l channelname -P "Channel Title: "; | |||
echo '<title>'$channelname'</title>' >> $channelfile; | |||
echo '<link>https://hub.xpub.nl/chopchop/worm/</link>' >> $channelfile; | |||
</syntaxhighlight> | |||
<p>Next, the user is prompted to provide a title for the channel. The input is stored in the variable <code>channelname</code>. <code>echo</code> is used to encase the value of <code>channelname</code> within within <code>xml</code> <code>title</code> tags. This line of code is appended to the <code>channelfile</code>.</p></li></ol> | |||
</li> | |||
<li><p>Line 16</p> | |||
<p>The following line of code implicitly suggests a location for the overall output of the skeleton command. It currently reads</p> | |||
<syntaxhighlight lang="fish">echo '<atom:link href="hub.xpub.nl/chopchop/worm/'$showid'" ref="self" type="application/rss+xml"/>' >> $channelfile; | |||
</syntaxhighlight> | |||
<p>This suggests that it will eventually be necessary to create the directory <code>/media/worm/radio/$showid</code> on chopchop. Doing so would intervene in the current structure of radio worm's archive and I am hesitant about doing so for two reasons:</p> | |||
<ol> | |||
<li>Space</li> | |||
<li>Risk Management</li></ol> | |||
<p>'''Space''' What is the output of the <code>skeleton</code> command when it takes a directory as an argument?</p> | |||
<pre>/tmp/skeleton/ | |||
/tmp/skeleton/$showid/ | |||
/tmp/skeleton/$showid/rss.xml | |||
/tmp/skeleton/$showid/e/ | |||
/tmp/skeleton/$showid/e/$guid/ | |||
/tmp/skeleton/$showid/e/$guid/item-n | |||
/tmp/skeleton/$showid/e/$guid/symbolic-link-to.mp3 | |||
</pre> | </pre> | ||
<p>The implication of line 16 is that eventually these temporary files are moved to <code>/media/worm/radio/</code> which corresponds to the public url <code>https://hub.xpub.nl/chopchop/worm/</code>. Adding files and directories to this location will take up space on the harddrive; perhaps more space than there is available.</p> | |||
The final | <p>'''Risk Management''' The shell is powerful and there's no undo function. Consequently, caution and care must be exercised when executing shell commands on files. Whereas chopchop has a backup of Radio Worm's sonic archive it would be time consuming and undesirable to have to recopy data from the original hard drive in the event of data loss. Whilst this is unlikely as long as the <code>rm</code> command is not used, things can still go wrong. For this reason, it might be better to situate the final, non-temporary location of the files at a distance from Worm's archive itself. Perhaps in the home folder of a user called 'podcasts', or on a separate hard drive.</p> | ||
<ol> | |||
<li><p>Improvements</p> | |||
<p>In light of the above, it's evident that line sixteen is bound up with the final location of the generated files. I'll eventually place the <code>rss.xml</code> file in <code>~/public_html/podcasts/$showid/</code>. Therefore line 16 can be changed to</p> | |||
<syntaxhighlight lang="fish">echo '<atom:link href="hub.xpub.nl/chopchop/river/public_html/podcasts/'$showid'" ref="self" type="application/rss+xml"/>' >> $channelfile; | |||
</syntaxhighlight></li></ol> | |||
</li> | |||
<li><p>Line 17</p> | |||
<p>I have set the channel language to English. This could be changed or made into a user defined variable. However, I will not make these changes here.</p> | |||
<syntaxhighlight lang="fish">echo '<language>en</language>' >> $channelfile; | |||
</syntaxhighlight></li> | |||
<li><p>lines 18 - 23</p> | |||
<p>The following lines do not need adjusting.</p> | |||
<syntaxhighlight lang="fish">read -l channeldesc -P "Channel Description: "; | |||
echo '<description><![CDATA['$channeldesc']]></description>' >> $channelfile; | |||
echo '<image><url>https://hub.xpub.nl/chopchop/worm/'$showid'/image.jpg</url>' >> $channelfile; | |||
echo '<link>https://hub.xpub.nl/chopchop/worm/</link>' >> $channelfile; | |||
echo '<title>'$channelname'</title>' >> $channelfile; | |||
echo '</image>' >> $channelfile; | |||
</syntaxhighlight></li> | |||
<li><p>Lines 24 - 43</p> | |||
<p>The following for loop acts upon the result of calling <code>ls</code> on the value of <code>argv</code>. Currently, the value of <code>argv</code> is supposed to be a directory full of .mp3 files. The for loop uses the name of each file to write item data in an <code>xml</code> format for each mp3 file in a given directory.</p> | |||
<syntaxhighlight lang="fish"># write the item data | |||
for file in (ls $argv); | |||
</syntaxhighlight> | |||
<p>Each mp3 file is assigned a <code>guid</code> which is equivalent to the <code>sha256sum</code> of the mp3 file. This was Thijs' suggestion. Piping the result of calling <code>sha256sum</code> on the full path of the mp3 file through a grep command is required to retrieve the <code>sha256sum</code> alone. Notably, <code>$argv</code> is used here to prefix the full path to the value of the mp3 <code>file</code>. I intend to change the value of <code>argv</code> to a channel and consequently the way in which the value of <code>guid</code> is set will have to change.</p> | |||
<syntaxhighlight lang="fish">set -l guid (sha256sum {$argv}/{$file} | grep -Eo "[[:alnum:]]{64}"); | |||
</syntaxhighlight> | |||
<p>In any case, the value of guid is utilised</p> | |||
<syntaxhighlight lang="fish">mkdir -p /tmp/skeleton/{$showid}/e/{$guid}; | |||
ln -s {$argv}/{$file} /tmp/skeleton/{$showid}/e/{$guid}/{$file}; | |||
set i (math $i + 1); | |||
set -l itemfile (printf '/tmp/skeleton/%s/e/%s/item-%i' $showid $guid $i); | |||
echo "<item>" > $itemfile; | |||
read -l title -P "Item $i Title: "; | |||
read -l desc -P "Item $i Description: "; | |||
echo "<title>"$title"</title>" >> $itemfile; | |||
echo "<pubDate>"(date)"</pubDate>" >> $itemfile; | |||
echo '<enclosure url="https://hub.xpub.nl/chopchop/worm/'$showid'/e/'$guid'/'$file'" length="'(soxi -D {$argv}/{$file})'" type="'(file -b --mime-type {$argv}/{$file})'"/>' >> $itemfile; | |||
sed -i 's/\\/\\/[^.]*\(\\/[^\\/]*.mp3\)/\1/' $itemfile; | |||
echo '<guid isPermaLink="false">'$guid'</guid>' >> $itemfile; | |||
set -l link (grep -Eo "https://[^.].mp3" $itemfile); | |||
echo '<link>'$link'</link>' >> $itemfile; | |||
echo '<description><![CDATA['$desc']]></description>' >> $itemfile; | |||
echo '</item>' >> $itemfile; | |||
end | |||
</syntaxhighlight> | |||
<syntaxhighlight lang="fish"> # concatenate items in reverse order and append to the channel file | |||
for item in (ls -t /tmp/skeleton/*/e/*/item-*); | |||
cat $item >> $channelfile; | |||
end | |||
# close the rss and channel tags | |||
echo '</channel>' >> $channelfile; | |||
echo '</rss>' >> $channelfile; | |||
end | |||
</syntaxhighlight></li></ol> |
Revision as of 21:48, 7 October 2023
Podcasts are RSS feeds
$ wget http://feeds.libsyn.com/330110/rss
On Monday 2nd, we briefly discussed podcasts as the antithesis of radio. That radio is live whereas podcasts are prerecorded and that modes of engaging with podcasts and radio differ. Podcasts are built upon Really Simple Syndication (RSS). In other words, in terms of code, there's little difference between the XML for an blog feed and the XML for a podcast feed. The following command combines a regular expression with the grep command to retrieve a list of some of the opening XML tags in RSS feed for the Call Me Mother podcast.
$ grep -E "<[[:alpha:]]+" rss
This command prints results such as
<pubDate>Fri, 02 Apr 2021 15:25:34 GMT</pubDate>
<title>Stephen Whittle</title>
and
<link>https://www.novel.audio</link>
These tags also appear in RSS feeds for written blogs. However, the command also prints results such as
<itunes:explicit>yes</itunes:explicit>
and
<acast:showId>62b087ec4f1d1f0014025b79</acast:showId>
Editing
What I want to retrieve is a list of the tags only. At the moment, I'm not interested in what's in between the tags. Ideally I'd like to use built in commands to generate a text which contains an outline of tags along the lines of the following listing.
1 <tag>
2 <subtag>
3 </subtag>
4 </tag>
First Attempt
Can this be achieved by making two files, open.txt
and close.txt
? The first file should contain all the opening tags and the latter all the closing tags.
$ grep -Eon "<[[:alpha:]]+>" rss > open.txt
$ grep -Eon "</[[:alpha:]]+>" rss > close.txt
It should then be possible to cat
both files and sort
the result by line number producing the desired outcome.
$ cat open.txt close.txt | sort -n > sorted.txt && cat sorted.txt | less
Second Attempt
Whilst reading through the output of the first attempt (sorted.txt
), I discovered that
- the
</guid>
tags had no correlating opening tag - That the closing tags were sorted before the opening tags
Problem 1: Some information was missing
I had already deliberately omitted acast
and itunes
tags. In doing so I worked on the assumption that there were no other relevant, colon-separated tags in the rss
file. Fortunately, retrieving the additional data was a quick fix:
$ grep -Eon "<[[:alpha:]]+>|<[[:alpha:]]+ [^z-A]*>" rss > open.txt
Figuring out a way to sort the file such that the closing tags followed the opening tags was another matter. The general outline was there, but if I could sort out the details this could become a template for a podcast RSS generator. And that could potentially be useful in relation to Worm's sonic archive.
Problem 2: Adding whitespace with sed
The following command prepends a new line to each instance of '</' in the rss
file.
sed 's/<\\//\n<\\//' rss >rss-out;
Now it's time to write the open.txt
and close.txt
files.
$ grep -Eon "<[[:alpha:]]+>|<[[:alpha:]]+ [^z-A]*>" rss-out > open.txt
$ grep -Eon "</[[:alpha:]]+>>" rss-out > close.txt
Placing a newline before each closing tag increases the line number on which each closing tag appears. Now it's time to concatenate open.txt
and close.txt
and view the output.
$ cat open.txt close.txt | sort -n > sorted.txt && cat sorted.txt | less
Third Attempt
On closer inspection of the output of sorted.txt I noticed a further lack of information. Various closing anchor tags had no corresponding opening tag. I therefore had to write a regular expression capable of matching the additional data in these tags. This might make the output of sorted.txt
a little less readable, Although there may be a way to tidy up this information. For completeness, it would also make sense to include the acast
and itunes
tags.
$ grep -Eon "<[[:alpha:]]+>|<[[:alpha:]]+ [^z-A]*>|<[[:alpha:]]+ [[:alpha:]]+([=\":/'.;_ ]|[[:alnum:]])*>|<itunes:[[:alpha:]]+>|<acast:[[:alpha:]]+>" rss-out > open.txt
$ grep -Eon "</[[:alpha:]]+>|</itunes:[[:alpha:]]+>|</acast:[[:alpha:]]+>" rss-out > close.txt
Then, as above
$ cat open.txt close.txt | sort -n > sorted.txt && cat sorted.txt | less
Fourth Attempt
I refined the regular expression. It now matches every opening and closing tag. Also, it's necessary to pass the global option to the sed substitution command. The second sed command prepends whitespace to every opening tag. This improves the structure of the output
sed 's/<\\//\n<\\//g' rss >rss-out
sed "s/<\([[:alpha:]][^>]*>\)/\n<\1/g" rss-out > rss-out-out
grep -Eon "<[[:alpha:]]([^>]*>)" rss-out-out > open.txt
grep -Eon "</([^>]*>)" rss-out-out > close.txt
cat open.txt close.txt | sort -n > sorted.txt
grep -Eo "<([^>]*>)" sorted.txt | less > skeleton.xml
chromium sorted.xml
The webpage complains about unmatched horizontal rule tags. I deleted all of these using the query replace function in Emacs. The webpage subsequently complained about <br>
and <br />
tags, so I deleted all of these too. The result is what I wanted. It views well in the browser; I am able to collapse tags for an abbreviated overview of the document.
Fleshing out the skeleton
item
tags are the primary constituent parts of the channel
tag in skeleton.xml
. However, the channel tag also contains several other tags. I have decided to design the podcast generator around this feature. This raises questions about the possible options the command might take. How does the user provide input? There are three aspects to the overall design.
- creating a system for generating items
- Creating a system for generating channel information
- Creating a system for adding items to a document containing channel information.
Wishlist
- Create a system for updating channel information.
- skeleton –add foo.mp3 bar.xml
A closer look at items
Each Item tag is made up of 15 tags. Many of these belong to the itunes
schema.
<itunes:title>
<itunes:duration>
<itunes:explicit>
<itunes:episodeType>
<itunes:season>
<itunes:episode>
<itunes:image href="https://assets.pippa.io/shows/62b087ec4f1d1f0014025b79/show-cover.jpg"/>
<itunes:summary>
Others belong to the acast
schema.
<acast:episodeId>
<acast:showId>
<acast:episodeUrl>
<acast:settings>
And then there's the rest.
<title>
<pubDate>
<enclosure url="https://sphinx.acast.com/p/open/s/62b087ec4f1d1f0014025b79/e/62f2226a61f23900137394a3/media.mp3" length="56168488" type="audio/mpeg"/>
<guid isPermaLink="false">
<description>
<link>
I intend only to make a basic podcast generator, so I'm not going to focus on the itunes and acast schemas.
The script
General Outline
The following shell script defines the function skeleton
. It takes a directory as an argument, asks the user for some information and writes an xml file.
#!/usr/bin/fish
function skeleton -d "Generate an RSS channel for a podcast" -a directory
set -l showid (uuidgen);
set -l options (fish_opt -s h -l help);
argparse --name='skeleton' 'h/help' -- $argv;
mkdir -p /tmp/skeleton/{$showid}/e/;
set -l channelfile /tmp/skeleton/{$showid}/rss.xml;
# write the channel data
echo '<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">' > $channelfile;
echo '<channel>' >> $channelfile;
echo '<ttl>60</ttl>' >> $channelfile;
echo '<generator>skeleton</generator>' >> $channelfile;
read -l channelname -P "Channel Title: ";
echo '<title>'$channelname'</title>' >> $channelfile;
echo '<link>https://hub.xpub.nl/chopchop/worm/</link>' >> $channelfile;
echo '<atom:link href="hub.xpub.nl/chopchop/worm/'$showid'" ref="self" type="application/rss+xml"/>' >> $channelfile;
echo '<language>en</language>' >> $channelfile;
read -l channeldesc -P "Channel Description: ";
echo '<description><![CDATA['$channeldesc']]></description>' >> $channelfile;
echo '<image><url>https://hub.xpub.nl/chopchop/worm/'$showid'/image.jpg</url>' >> $channelfile;
echo '<link>https://hub.xpub.nl/chopchop/worm/</link>' >> $channelfile;
echo '<title>'$channelname'</title>' >> $channelfile;
echo '</image>' >> $channelfile;
# write the item data
for file in (ls $argv);
set -l guid (sha256sum {$argv}/{$file} | grep -Eo "[[:alnum:]]{64}");
mkdir -p /tmp/skeleton/{$showid}/e/{$guid};
ln -s {$argv}/{$file} /tmp/skeleton/{$showid}/e/{$guid}/{$file};
set i (math $i + 1);
set -l itemfile (printf '/tmp/skeleton/%s/e/%s/item-%i' $showid $guid $i);
echo "<item>" > $itemfile;
read -l title -P "Item $i Title: ";
read -l desc -P "Item $i Description: ";
echo "<title>"$title"</title>" >> $itemfile;
echo "<pubDate>"(date)"</pubDate>" >> $itemfile;
echo '<enclosure url="https://hub.xpub.nl/chopchop/worm/'$showid'/e/'$guid'/'$file'" length="'(soxi -D {$argv}/{$file})'" type="'(file -b --mime-type {$argv}/{$file})'"/>' >> $itemfile;
sed -i 's/\\/\\/[^.]*\(\\/[^\\/]*.mp3\)/\1/' $itemfile;
echo '<guid isPermaLink="false">'$guid'</guid>' >> $itemfile;
set -l link (grep -Eo "https://[^.].mp3" $itemfile);
echo '<link>'$link'</link>' >> $itemfile;
echo '<description><![CDATA['$desc']]></description>' >> $itemfile;
echo '</item>' >> $itemfile;
end
# concatenate items in reverse order and append to the channel file
for item in (ls -t /tmp/skeleton/*/e/*/item-*);
cat $item >> $channelfile;
end
# close the rss and channel tags
echo '</channel>' >> $channelfile;
echo '</rss>' >> $channelfile;
end
Analysis of the script
Lines 1 - 6
#!/usr/bin/fish function skeleton -d "Generate an RSS channel for a podcast" -a directory set -l showid (uuidgen); set -l options (fish_opt -s h -l help); argparse --name='skeleton' 'h/help' -- $argv; mkdir -p /tmp/skeleton/{$showid}/e/;
The function
skeleton
is given a description and takes adirectory
as an argument. Next two local variables are setshowid
andoptions
. The former has the value of evaluating the commanduuidgen
. The latter allows for the user to pass-h
or--help
toskeleton
which will display help information. I need to figure out how to enable more options than 'help' alone.argparse
is a command which defines how theskeleton
command takes both options and arguments. The script next calls the commandmkdir
to create a temporary file structure. Theshowid
variable expands to the value which was set in line 3.Improvements
#!/usr/bin/fish function skeleton -d "Generate an RSS channel for a podcast" -a channel set -l options (fish_opt -s h -l help); set options $options (fish_opt --short=g --long=generate --required-val); set options $options (fish_opt --short=a --long=add --required-val --multiple-vals); argparse $options -- $argv; or return if set -ql _flag_help echo "skeleton [ -g DIR | -a FILE | -h ] CHANNEL -h --help Display this text -g --genenerate DIR Generate an RSS feed for DIR -a --add FILE Add an item to a channel Skeleton is a command line tool for generating and writing RSS feeds for podcasts." return 0 end if set _flag_generate and set _flag_add; echo 'ERROR: skeleton cannot add and generate simultaneously' return 1 end if set -ql _flag_generate # echo $_flag_generate end if set -ql _flag_add # echo $_flag_add end end
I have adjusted the code such that different flags create different ways of interacting with the command. I have added three flags,
-h
,-g
and-a
, in full--help
,--generate
and--add
. These are stored in theoptions
variable. Theadd
flag can be passed toskeleton
multiple times. Furthermore,skeleton
now takes a channel as an argument. This is more flexible than taking a directory as an argument. I have made use ofif
statements to make the function do things under certain conditions. The first if statement relates to thehelp
flag. If that flag is passed toskeleton
, then text detailing how to use the command is displayed. The second if statement prevents the user from adding items to rss feeds and generating rss feeds simultaneously.--add
and--generate
become mutually exclusive options. The remaining if statements control what the programme does when thegenerate
andadd
flags are passed toskeleton
. Some lines have been omitted because it makes sense to place them elsewhere following these changes.
Lines 7 - 15
set -l channelfile /tmp/skeleton/{$showid}/rss.xml; # write the channel data echo '<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">' > $channelfile; echo '<channel>' >> $channelfile; echo '<ttl>60</ttl>' >> $channelfile; echo '<generator>skeleton</generator>' >> $channelfile;
Previously, the programme took a directory as an argument. Now it takes a channel as an argument. Taking a directory as an argument resembles the
generate
option. Consequently, much of the code from before can be placed within a newly written if statement. This extends to, with some adjustments, some of the lines which were omitted above.Improvements
set -l showid (sha256sum {$_flag_generate} | grep -Eo "[[:alnum:]]{64}"); mkdir -p /tmp/skeleton/{$showid};
Due to the resemblance described above the remainder of the script will be placed inside the third if statement. Some changes can be made to the way in which the value of
showid
is determined. Rather than assigning a random id withuuidgen
thesha256sum
of the value of_flag_generate
, a directory, is used. Thegrep
command ensures that only thesha256sum
is assigned to the value ofshowid
. The full result of calling the commandsha256sum
on the value of_flag_generator
contains extra information that needs to be filtered out. The code goes on to create a temporary directory using the value of theshowid
variable.Keeping track of things for later
Somehow, when using the
add
flag,skeleton
will need to retrieve the value ofshowid
for a given channel. This will ensure coherent and consistent output. I'm going to write a file which pairs theshowid
with the value offlag_generate
. The code for theadd
flag will consult this file when it sets the value ofshowid
.echo $_flag_generate $showid >> /tmp/skeleton/index
Further improvements
set -l channelfile /tmp/skeleton/{$showid}/rss.xml; # write the channel data echo '<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">' > $channelfile; echo '<channel>' >> $channelfile; echo '<ttl>60</ttl>' >> $channelfile; echo '<generator>skeleton</generator>' >> $channelfile;
A
channelfile
is then set. Redirecting the output ofecho
to the value ofchannelfile
creates the file and stores the output there.read -l channelname -P "Channel Title: "; echo '<title>'$channelname'</title>' >> $channelfile; echo '<link>https://hub.xpub.nl/chopchop/worm/</link>' >> $channelfile;
Next, the user is prompted to provide a title for the channel. The input is stored in the variable
channelname
.echo
is used to encase the value ofchannelname
within withinxml
title
tags. This line of code is appended to thechannelfile
.
Line 16
The following line of code implicitly suggests a location for the overall output of the skeleton command. It currently reads
echo '<atom:link href="hub.xpub.nl/chopchop/worm/'$showid'" ref="self" type="application/rss+xml"/>' >> $channelfile;
This suggests that it will eventually be necessary to create the directory
/media/worm/radio/$showid
on chopchop. Doing so would intervene in the current structure of radio worm's archive and I am hesitant about doing so for two reasons:- Space
- Risk Management
Space What is the output of the
skeleton
command when it takes a directory as an argument?/tmp/skeleton/ /tmp/skeleton/$showid/ /tmp/skeleton/$showid/rss.xml /tmp/skeleton/$showid/e/ /tmp/skeleton/$showid/e/$guid/ /tmp/skeleton/$showid/e/$guid/item-n /tmp/skeleton/$showid/e/$guid/symbolic-link-to.mp3
The implication of line 16 is that eventually these temporary files are moved to
/media/worm/radio/
which corresponds to the public urlhttps://hub.xpub.nl/chopchop/worm/
. Adding files and directories to this location will take up space on the harddrive; perhaps more space than there is available.Risk Management The shell is powerful and there's no undo function. Consequently, caution and care must be exercised when executing shell commands on files. Whereas chopchop has a backup of Radio Worm's sonic archive it would be time consuming and undesirable to have to recopy data from the original hard drive in the event of data loss. Whilst this is unlikely as long as the
rm
command is not used, things can still go wrong. For this reason, it might be better to situate the final, non-temporary location of the files at a distance from Worm's archive itself. Perhaps in the home folder of a user called 'podcasts', or on a separate hard drive.Improvements
In light of the above, it's evident that line sixteen is bound up with the final location of the generated files. I'll eventually place the
rss.xml
file in~/public_html/podcasts/$showid/
. Therefore line 16 can be changed toecho '<atom:link href="hub.xpub.nl/chopchop/river/public_html/podcasts/'$showid'" ref="self" type="application/rss+xml"/>' >> $channelfile;
Line 17
I have set the channel language to English. This could be changed or made into a user defined variable. However, I will not make these changes here.
echo '<language>en</language>' >> $channelfile;
lines 18 - 23
The following lines do not need adjusting.
read -l channeldesc -P "Channel Description: "; echo '<description><![CDATA['$channeldesc']]></description>' >> $channelfile; echo '<image><url>https://hub.xpub.nl/chopchop/worm/'$showid'/image.jpg</url>' >> $channelfile; echo '<link>https://hub.xpub.nl/chopchop/worm/</link>' >> $channelfile; echo '<title>'$channelname'</title>' >> $channelfile; echo '</image>' >> $channelfile;
Lines 24 - 43
The following for loop acts upon the result of calling
ls
on the value ofargv
. Currently, the value ofargv
is supposed to be a directory full of .mp3 files. The for loop uses the name of each file to write item data in anxml
format for each mp3 file in a given directory.# write the item data for file in (ls $argv);
Each mp3 file is assigned a
guid
which is equivalent to thesha256sum
of the mp3 file. This was Thijs' suggestion. Piping the result of callingsha256sum
on the full path of the mp3 file through a grep command is required to retrieve thesha256sum
alone. Notably,$argv
is used here to prefix the full path to the value of the mp3file
. I intend to change the value ofargv
to a channel and consequently the way in which the value ofguid
is set will have to change.set -l guid (sha256sum {$argv}/{$file} | grep -Eo "[[:alnum:]]{64}");
In any case, the value of guid is utilised
mkdir -p /tmp/skeleton/{$showid}/e/{$guid}; ln -s {$argv}/{$file} /tmp/skeleton/{$showid}/e/{$guid}/{$file}; set i (math $i + 1); set -l itemfile (printf '/tmp/skeleton/%s/e/%s/item-%i' $showid $guid $i); echo "<item>" > $itemfile; read -l title -P "Item $i Title: "; read -l desc -P "Item $i Description: "; echo "<title>"$title"</title>" >> $itemfile; echo "<pubDate>"(date)"</pubDate>" >> $itemfile; echo '<enclosure url="https://hub.xpub.nl/chopchop/worm/'$showid'/e/'$guid'/'$file'" length="'(soxi -D {$argv}/{$file})'" type="'(file -b --mime-type {$argv}/{$file})'"/>' >> $itemfile; sed -i 's/\\/\\/[^.]*\(\\/[^\\/]*.mp3\)/\1/' $itemfile; echo '<guid isPermaLink="false">'$guid'</guid>' >> $itemfile; set -l link (grep -Eo "https://[^.].mp3" $itemfile); echo '<link>'$link'</link>' >> $itemfile; echo '<description><![CDATA['$desc']]></description>' >> $itemfile; echo '</item>' >> $itemfile; end
# concatenate items in reverse order and append to the channel file for item in (ls -t /tmp/skeleton/*/e/*/item-*); cat $item >> $channelfile; end # close the rss and channel tags echo '</channel>' >> $channelfile; echo '</rss>' >> $channelfile; end