Revision as of 17:30, 5 October 2023

Podcasts, RSS feeds, `grep`

$ wget http://feeds.libsyn.com/330110/rss

On Monday 2nd, we discussed podcasts as opposed to radio. That radio is live whereas podcasts are prerecorded and that modes of engaging with podcasts and radio differ. Podcasts are built upon Really Simple Syndication (RSS). In other words, in terms of code, there's little difference between the XML for an blog feed and the XML for a podcast feed. The following command combines a regular expression with the grep command to retrieve a list of all the XML tags in RSS feed for the Call Me Mother podcast.

$ grep -E "<[[:alpha:]]+" rss

This command prints results such as

<pubDate>Fri, 02 Apr 2021 15:25:34 GMT</pubDate>

<title>Stephen Whittle</title>  

and

<link>https://www.novel.audio</link>  

These tags also appear in RSS feeds for written blogs. However, the command also prints results such as

<itunes:explicit>yes</itunes:explicit>  

and

<acast:showId>62b087ec4f1d1f0014025b79</acast:showId>

Editing

What I want to produce is a list of the tags only. At the moment, I'm not interested in what's in between the tags. Ideally I'd like to use built in commands to generate a text which contains an outline of tags along the lines of the following listing.

    1 <tag>
    2 <subtag>
    3 </subtag>
    4 </tag>

First Attempt

Can this be achieved by making two files, open.txt and close.txt? The first file should contain all the opening tags and the latter all the closing tags.

$ grep -Eon "<[[:alpha:]]+>" rss > open.txt
$ grep -Eon "</[[:alpha:]]+>" rss > close.txt 

It should then be possible to cat both files and sort the result by line number producing the desired outcome.

$ cat open.txt close.txt | sort -n > sorted.txt

Second Attempt

Whilst reading through the output of the first attempt (sorted.txt), I discovered that

the </guid> tags had no correlating opening tag

That the closing tags were sorted before the opening tags

Problem 1: Some information was missing

I had already deliberately omitted acast and itunes tags. In doing so I worked on the assumption that there were no other relevant, colon-separated tags in the rss file. Fortunately, retrieving the additional data was a quick fix:

   $ grep -Eon "<[[:alpha:]]+>|<[[:alpha:]]+ [^z-A]*>" rss > open.txt

Figuring out a way to sort the file such that the closing tags followed the opening tags was another matter. The general outline was there, but if I could sort out the details this could become a template for a podcast RSS generator. And that could potentially be useful in relation to Worm's sonic archive.

Problem 2: Adding whitespace with `sed`

The following command prepends a new line to each instance of '</' in the rss file.

   sed 's/<\\//\n<\\//' rss >rss-out;

Now it's time to write the open.txt and close.txt files.

   $ grep -Eon "<[[:alpha:]]+>|<[[:alpha:]]+ [^z-A]*>" rss-out > open.txt
   $ grep -Eon "</[[:alpha:]]+>>" rss-out > close.txt

Placing a newline before each closing tag increases the line number on which each closing tag appears. Now it's time to concatenate open.txt and close.txt and view the output.

   $ cat open.txt close.txt | sort -n > sorted.txt && cat sorted.txt | less

Third Attempt

On closer inspection of the output of sorted.txt I noticed a further lack of information. Various closing anchor tags had no corresponding opening tag. I therefore had to write a regular expression capable of matching the additional data in these tags. This might make the output of sorted.txt a little less readable, Although there may be a way to tidy up this information. For completeness, it would also make sense to include the acast and itunes tags.

   $ grep -Eon "<[[:alpha:]]+>|<[[:alpha:]]+ [^z-A]*>|<[[:alpha:]]+ [[:alpha:]]+([=\":/'.;_ ]|[[:alnum:]])*>|<itunes:[[:alpha:]]+>|<acast:[[:alpha:]]+>" rss-out > open.txt
   $ grep -Eon "</[[:alpha:]]+>|</itunes:[[:alpha:]]+>|</acast:[[:alpha:]]+>" rss-out > close.txt

Then, as above

   $ cat open.txt close.txt | sort -n > sorted.txt && cat sorted.txt | less

Fourth Attempt

I refined the regular expression for the grep command. It now matches every opening and closing tag. Also, it's necessary to pass the global option to the sed substitution command. The second sed command prepends whitespace to every opening tag. This improves the structure of the output

    $ sed 's/<\\//\n<\\//g' rss >rss-out
    $ sed "s/<\([[:alpha:]][^>]*>\)/\n<\1/g" rss-out > rss-out-out
    $ grep -Eon "<[[:alpha:]]([^>]*>)" rss-out-out > open.txt
    $ grep -Eon "</([^>]*>)" rss-out-out > close.txt
    $ cat open.txt close.txt | sort -n > sorted.txt
    $ grep -Eo "<([^>]*>)" sorted.txt | less > skeleton.xml
    $ chromium sorted.xml

The webpage complains about unmatched horizontal rule tags. I deleted all of these using the query replace function in Emacs. The webpage subsequently complained about <br> and <br /> tags, so I deleted all of these too. The result is what I wanted. It views well in the browser; I am able to collapse tags for an abbreviated overview of the document.

Fleshing out the skeleton

item tags are the primary constituent parts of the channel tag in skeleton.xml. However, the channel tag also contains several other tags. I have decided to design the podcast generator around this feature. This raises questions about the possible options the command might take. How does the user provide input? There are three aspects to the overall design.

Creating a system for generating items
Creating a system for generating channel information
Creating a system for adding items to a document containing channel information.

Wishlist

Create a system for updating channel information.
skeleton –add foo.mp3 bar.xml

A closer look at items

Each Item tag is made up of 15 tags. Many of these belong to the itunes schema.

    <itunes:title>
    
    <itunes:duration>
    
    <itunes:explicit>
    
    <itunes:episodeType>
    
    <itunes:season>
    
    <itunes:episode>
    
    <itunes:image href="https://assets.pippa.io/shows/62b087ec4f1d1f0014025b79/show-cover.jpg"/>
    
    <itunes:summary>

Others belong to the acast schema.

    <acast:episodeId>
    
    <acast:showId>
    
    <acast:episodeUrl>
    
    <acast:settings>

And then there's the rest.

    <title>
    
    <pubDate>
    
    <enclosure url="https://sphinx.acast.com/p/open/s/62b087ec4f1d1f0014025b79/e/62f2226a61f23900137394a3/media.mp3" length="56168488" type="audio/mpeg"/>
    
    <guid isPermaLink="false">
    
    <description>
    
    <link>

I intend only to make a basic podcast generator, so I'm not going to focus on the itunes and acast schemas.

The item generator script

The following shell script defines the function skeleton. It takes a directory as an argument, asks the user for some some information and writes an xml file. The directory should contain mp3 files. Each mp3 file corresponds to one item in the channel. To run the command, it's necessary to install sox and libsox-fmt-mp3.

    #!/usr/bin/fish
    function skeleton -d "Generate an RSS channel for a podcast" -a directory
        set -l showid (uuidgen);
        set -l options (fish_opt -s h -l help);
        argparse --name='skeleton' 'h/help' -- $argv;
        mkdir -p /tmp/skeleton/{$showid}/e/;
        set -l channelfile /tmp/skeleton/{$showid}/rss.xml;
        # write the channel data
        echo '<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">' > $channelfile;
        echo '<channel>' >> $channelfile;
        echo '<ttl>60</ttl>' >> $channelfile;
        echo '<generator>skeleton</generator>' >> $channelfile;
        read -l channelname -P "Channel Title: ";
        echo '<title>'$channelname'</title>' >> $channelfile;
        echo '<link>https://hub.xpub.nl/chopchop/worm/</link>' >> $channelfile;
        echo '<atom:link href="hub.xpub.nl/chopchop/worm/'$showid'" ref="self" type="application/rss+xml"/>' >> $channelfile;
        echo '<language>en</language>' >> $channelfile;
        read -l channeldesc -P "Channel Description: ";
        echo '<description><![CDATA['$channeldesc']]></description>' >> $channelfile;
        echo '<image><url>https://hub.xpub.nl/chopchop/worm/'$showid'/image.jpg</url>' >> $channelfile;
        echo '<link>https://hub.xpub.nl/chopchop/worm/</link>' >> $channelfile;
        echo '<title>'$channelname'</title>' >> $channelfile;
        echo '</image>' >> $channelfile;
        # write the item data
        for file in (ls $argv);
            set -l guid (sha256sum {$argv}/{$file} | grep -Eo "[[:alnum:]]{64}");
            mkdir -p /tmp/skeleton/{$showid}/e/{$guid};
            ln -s {$argv}/{$file} /tmp/skeleton/{$showid}/e/{$guid}/{$file};
            set i (math $i + 1);
            set -l itemfile (printf '/tmp/skeleton/%s/e/%s/item-%i' $showid $guid $i);
            echo "<item>" > $itemfile;
            read -l title -P "Item $i Title: ";
            read -l desc -P "Item $i Description: ";
            echo "<title>"$title"</title>" >> $itemfile;
            echo "<pubDate>"(date)"</pubDate>" >> $itemfile;
            echo '<enclosure url="https://hub.xpub.nl/chopchop/worm/'$showid'/e/'$guid'/'$file'" length="'(soxi -D {$argv}/{$file})'" type="'(file -b --mime-type {$argv}/{$file})'"/>' >> $itemfile;
            sed -i 's/\\/\\/[^.]*\(\\/[^\\/]*.mp3\)/\1/' $itemfile;
            echo '<guid isPermaLink="false">'$guid'</guid>' >> $itemfile;
            set -l link (grep -Eo "https://[^.].mp3" $itemfile);
            echo '<link>'$link'</link>' >> $itemfile;
            echo '<description><![CDATA['$desc']]></description>' >> $itemfile;
            echo '</item>' >> $itemfile;
        end
        # concatenate items in reverse order and append to the channel file
        for item in (ls -t /tmp/skeleton/*/e/*/item-*);
            cat $item >> $channelfile;
        end
        # close the rss and channel tags
        echo '</channel>' >> $channelfile;
        echo '</rss>' >> $channelfile;
    end

The final step is to move the files in /tmp/skeleton/ to a more permanent location.

User:Riviera/Podcast rss: Difference between revisions

Revision as of 17:30, 5 October 2023

Contents

Podcasts, RSS feeds, `grep`

Editing

First Attempt

Second Attempt

Problem 1: Some information was missing

Problem 2: Adding whitespace with `sed`

Third Attempt

Fourth Attempt

Fleshing out the skeleton

A closer look at items

The item generator script

Revision as of 17:30, 5 October 2023

Podcasts, RSS feeds, grep

Editing

First Attempt

Second Attempt

Problem 1: Some information was missing

Problem 2: Adding whitespace with sed

Third Attempt

Fourth Attempt

Fleshing out the skeleton

A closer look at items

The item generator script

Podcasts, RSS feeds, `grep`

Problem 2: Adding whitespace with `sed`