So I know there are lots of great ways to get RSS withing the browser and dedicated RSS apps on any platform. From Gnome, KDE to OSX and other embedded scripts on your favorite editor Emacs. However there is always some cool way to read your favorite news from your site.
So what our handy shell script can do to parse this news and show us our latest info from the internet.
We need to remind ourselves RSS is XML
So we can leverage some of the XML tools that there are. Another thing we should remember is that the schema is well known. So we would look for very specific tags and iterate through them.
The key tags for the RSS tree we want to pay attention are the following:
item
link
description
PubDate
The item tag will wrap around the RSS news and inside this item we would see the Title, PubDate, Description and link.
One of the big command we will use in Bash is the read command. We need to prepare our code to understand the concept of TAGS. Tags are words wrapped around the '< and > ' symbols.
Time for some code
The read command has important global variables like $IFS, if you look at the manual it says IFS has *Internal File Separator*, which is used to parse the tags by doing: local IFS='>'
. And we go back to the start of the tag by doing read -d
for the delimiter.
We can create a function so that we can conceptualize the tag.
identify_tags() {
local IFS='>'
read -d '<' TAG CONTENT
}
With this function we can get the content into the iteration of how a tag will be parsed.
Now we use some loop over the file and see how many tags we can catch by using cat
and while
and echo
.
cat $1 | while identify_tags ; do
echo "<$TAG>{$CONTENT}"
done
With this, we can also have the use for tag
and content
. After executing our function through the loop we have the following output:
<>{}
<xml version="1.0" encoding="UTF-8" ?>{}
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0">{}
<channel>{}
<title>{RSS title}
</title>{}
<link>{http://www.example.com/}
</link>{}
<description>{RSS description}
</description>{}
<language>{en}
</language>{}
<item>{}
<title>{News item title}
</title>{}
<link>{http://www.example.com/link/to/news/item/}
</link>{}
<guid isPermaLink="true">{identifier-5f4b02697d2006f72648ebd0d9c6ce96}
</guid>{}
<description>{Full news item text.}
</description>{}
<pubDate>{Fri, 01 Jul 2016 17:41:07 +0000}
</pubDate>{}
</item>{}
</channel>{}
This makes the tags modify to something like the tag followed by the value wrapped under brackets. Something like this <description>{RSS description}
.
Time to parse
So we have a general tag parser which isolate the content. But we still need to catch the desired tags, so we need a filter, not one but many of them, as there are multiple tags. This is where we use case
to be able to provide different scenarios.
With case
command with our magic $TAG
variable over our different desired tags. If you haven't use case
in bash, here is the quick description.
Bash case statements are generally used to simplify complex conditionals when you have multiple different choices. Using the case statement instead of nested if statements will help you make your bash scripts more readable and easier to maintain.
Note: Although we are using case here, this might be replaced in the future with switch.
So controversy aside we go into building our case, the key here is the initial expression which means iterate through our desired tags.
case $TAG in
'item')
title=''
link=''
pubDate=''
description=''
;;
With this filter we have a $TAG
and extract it to the different items which work as an array for title
, link
, pubDate
and description
.
And from that, we start iterating to the different tree. To something like this for title
'title')
title="$CONTENT"
;;
And so did we have the following construct:
cat $1 | while identify_tags ; do
case $TAG in
'item')
title=''
link=''
pubDate=''
description=''
;;
'title')
title="$CONTENT"
;;
'link')
link="$CONTENT"
;;
'pubDate')
pubDate="$CONTENT"
;;
'description')
description="$CONTENT"
;;
'/item')
cat<<EOF
Pro Tip: notice the EOF and << operator. This allow us to recursively use a command after we did the whole filtering. Pretty meta hey!?!
So now we can fuse this with our function and we can expect this code:
#!/bin/sh
identify_tags () {
local IFS='>'
read -d '<' TAG CONTENT
}
cat $1 | while identify_tags ; do
case $TAG in
'item')
title=''
link=''
pubDate=''
description=''
;;
'title')
title="$CONTENT"
;;
'link')
link="$CONTENT"
;;
'pubDate')
# convert pubDate format for <time datetime="">
datetime=$( date --date "$VALUE" --iso-8601=minutes )
pubDate=$( date --date "$VALUE" '+%D %H:%M%P' )
;;
'description')
# convert '<' and '>' to '<' and '>'
description=$( echo "$VALUE" | sed -e 's/</</g' -e 's/>/>/g' )
;;
'/item')
cat<<EOF
<article>
<h3><a href="$link">$title</a></h3>
<p>$description
<span class="post-date">posted on <time
datetime="$datetime">$pubDate</time></span></p>
</article>
EOF
;;
esac
done
Costumizations
This will create a pretty formatted HTML bit, however you could also do a CSV file by changing those <> to commas or semicolons.
``"$title", "$link", "$description"
``
Or just forget about title and link and for something like a podcast, we might just be interested for the download links. (BTW the tag is embeded) and use curl or wget to download everything.
#!/bin/sh
identify_tags () {
local IFS='>'
read -d '<' TAG CONTENT
}
cat $1 | while identify_tags ; do
case $TAG in
'item')
embed=''
;;
'embed')
title="$CONTENT"
;;
'/item')
cat<<EOF
$(wget -c $embed)
EOF
;;
esac
done
So there you go, pretty nifty little script to get your latest podcast.
Make sure to let me know if you liked this on the comment or if you notice an issue. I will be happy to update it. Hope you learned a thing or two on how powerful your good ol bash can be.
Hi! You should use triple backticks (
```
) or "code block" button to start a multi-line code block, it makes it much more readable.