PHP Script for RSS auto-discovery and OPML file generation

Sun 07 Mar 2010

Hey All,

I recently got a reasonable size list of blog URLs. What I wanted was to import all these into a feed reader (via OPML). There seemed to be a lack of conversion scripts for batch URL->find RSS link->feed reader import file (I may be wrong, please let me know if I am :), so I made one in PHP. I guess this is like an automatic-blogroller. I have just used this as a command line script, I'm not recommending you use this in 'the wild' as one might say, I have made little concession to security as I had a trusted list of URLs.

There are basically three steps to this

  1. Take an input file of newline seperated URLs, in my case blogs.
  2. Find (auto-discover) associated RSS feed of each blog URL
  3. Output an OPML file that you can use to import into a feed reader What it does:
  • Takes a well formed list of newline separated URLs of blogs and turns it into an OPML
  • If the URL source doesn't contain a <link> to an RSS feed in the head it doesn't add it to the OPML
  • Detects the <title> and adds that to the OPML text field, or uses the URL if <title> isn't present What it doesn't:

  • Check the RSS feed is validated XML

  • Any other checking really :)

  • Come with any sort of warranty/guarantee Some of the key functions are from Keith Devens work. Thanks. <!--more--> Without any further ado, here is the script:

<?php
/*
 * @author @skinofstars Kevin Carmody
 * GPLv3 - http://www.gnu.org/copyleft/gpl.html
 *
 * this is really a command line app with no flags
 * for turning a bunch ofurls into an OPML file
 *
 * 1.takes input file of newline seperated urls, normally blogs
 * 2.finds (autodiscovery) associated rss of each url
 * 3.outputs an OPML file for you to use in a feed reader
 */

// file config
$inputFile = "/path/to/URLlist.txt";
$outputFile = "/path/to/blogroll.opml";

// OPML config
$opmlTitle = "Some Select Blogs";
$opmlOwnerName = "Kevin Carmody";
$opmlOwnerEmail = "[email protected]";

/** no need to edit after this :) **/
$inHandle = @fopen($inputFile, "r");//read-only
$outHandle = @fopen($outputFile, "a");//append

if ($inHandle &amp;&amp; $outHandle) {
    $headerOut = opmlHeader($opmlTitle,$opmlOwnerName,$opmlOwnerEmail);
    fwrite($outHandle,$headerOut);

    while (!feof($inHandle)) {
        $buffer = fgets($inHandle, 4096);
        $source = getFile($buffer);
        $rssURL = getRSSLocation($source, $buffer);
        $rssTitle = htmlentities(getTitleAlt($source));
        if($rssURL){
            if($rssTitle){
                $entryOut = opmlEntry($rssURL,$rssTitle);
                fwrite($outHandle,$entryOut);
            } else {
                $entryOut = opmlEntry($rssURL,$rssURL);
                fwrite($outHandle,$entryOut);
            }
            //echo ".";//uncomment to print a dot to screen on each success, nice for seeing progress
     } else {
            echo "Fail on: ".$buffer;
        }
    }
    $footerOut = opmlFooter();
    fwrite($outHandle,$footerOut);

    fclose($inHandle);
    fclose($outHandle);
} else {
    if(!$inHandle){
        echo 'not got a handle on input file: '.$inputFile."\n";
        die;
    }
    if(!$outHandle){
        echo 'not got a got handle on output file: '.$outputFile."\n";
        die;
    }
}

echo "\nAll done :)\n";

/**
 * basic opml header
 * @param string $opmlTitle
 * @param string $opmlOwnerName
 * @param string $opmlOwnerEmail
 * @return string
 */
function opmlHeader($opmlTitle,$opmlOwnerName,$opmlOwnerEmail){
    $oheader = "&lt;?xml version=\"1.0\" encoding=\"ISO-8859-1\"?&gt;\n"
    ."&lt;opml version=\"1.1\"&gt;\n"
    ."  &lt;head&gt;\n"
    ."      &lt;title&gt;".$opmlTitle."&lt;/title&gt;\n"
    ."      &lt;dateCreated&gt;".date("r")."&lt;/dateCreated&gt;\n"
    ."      &lt;ownerName&gt;".$opmlOwnerName."&lt;/ownerName&gt;\n"
    ."      &lt;ownerEmail&gt;".$opmlOwnerEmail."&lt;/ownerEmail&gt;\n"
    ."      &lt;/head&gt;\n"
    ."  &lt;body&gt;\n";
    return $oheader;
}

/**
 * just returns a test footer
 * @return string
 */
function opmlFooter(){
    $ofooter = "  &lt;/body&gt;\n"
    ."&lt;/opml&gt;";
    return $ofooter;
}

/**
 * creates an XML entry for the OPML file
 * @param string $feedURL
 * @param string $feedTitle
 * @return string
 */
function opmlEntry($feedURL,$feedTitle){
    $outline = "    &lt;outline text=\"".$feedTitle."\" type=\"rss\" xmlUrl=\"".$feedURL."\"/&gt;\n";
    return $outline;
}

/**
 * returns the page title extracted from source
 * @param string $html
 * @return string
 */
function getTitleAlt($html) {
    if (preg_match('/&lt;title&gt;(.*?)&lt;\/title&gt;/is',$html,$found)) {
        $title = $found[1];
        return $title;
    } else {
        return;
    }
}

/**
 * http://keithdevens.com/weblog/archive/2002/Jun/03/RSSAuto-DiscoveryPHP
 * public domain
 */
function getFile($location){
    $ch = curl_init($location);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt($ch, CURLOPT_HTTPHEADER, array('Connection: close'));
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_TIMEOUT, 15);
    $response = curl_exec($ch);
    curl_close($ch);
    return $response;
}

/**
 * http://keithdevens.com/weblog/archive/2002/Jun/03/RSSAuto-DiscoveryPHP
 * public domain
 */
function getRSSLocation($html, $location){
    if(!$html or !$location){
        return false;
    }else{
        #search through the HTML, save all &lt;link&gt; tags
     # and store each link's attributes in an associative array
     preg_match_all('/&lt;link\s+(.*?)\s*\/?&gt;/si', $html, $matches);
        $links = $matches[1];
        $final_links = array();
        $link_count = count($links);
        for($n=0; $n&lt;$link_count; $n++){
            $attributes = preg_split('/\s+/s', $links[$n]);
            foreach($attributes as $attribute){
                $att = preg_split('/\s*=\s*/s', $attribute, 2);
                if(isset($att[1])){
                    $att[1] = preg_replace('/([\'"]?)(.*)\1/', '$2', $att[1]);
                    $final_link[strtolower($att[0])] = $att[1];
                }
            }
            $final_links[$n] = $final_link;
        }
        #now figure out which one points to the RSS file
     for($n=0; $n&lt;$link_count; $n++){
            if(strtolower($final_links[$n]['rel']) == 'alternate'){
                if(strtolower($final_links[$n]['type']) == 'application/rss+xml'){
                    $href = $final_links[$n]['href'];
                }
                if(!$href and strtolower($final_links[$n]['type']) == 'text/xml'){
                    #kludge to make the first version of this still work
                 $href = $final_links[$n]['href'];
                }
                if($href){
                    if(strstr($href, "http://") !== false){ #if it's absolute
                     $full_url = $href;
                    }else{ #otherwise, 'absolutize' it
                     $url_parts = parse_url($location);
                        #only made it work for http:// links. Any problem with this?
                     $full_url = "http://$url_parts[host]";
                        if(isset($url_parts['port'])){
                            $full_url .= ":$url_parts[port]";
                        }
                        if($href{0} != '/'){ #it's a relative link on the domain
                         $full_url .= dirname($url_parts['path']);
                            if(substr($full_url, -1) != '/'){
                                #if the last character isn't a '/', add it
                             $full_url .= '/';
                            }
                        }
                        $full_url .= $href;
                    }
                    return $full_url;
                }
            }
        }
        return false;
    }
}

Though this was really a one time hit for me it may well be useful to others. Please let me know if you can think of ways to improve it and I will update accordingly.

Thanks, Kevin

Moving hosted SVN, the trials and the tribulations

Sun 07 Feb 2010

Over the last few weeks Mike Robinson and I have discussed and decided an SVN restructuring for improving our build and deployment processes. I would encourage you to read a bit more about that (and various other geekness) at his blog.

So I've spent this week moving our company hosted SVN from Beanstalk to Springloops. I feel I've been swinging between hell and zen, but the learning has been awesome. As a summary of what I've found I thought I'd give a quick walk through how I did it.

Most of this stuff was the usual dump/load cycle, but there are a couple of things which needed some extra attention.

Firstly, both Beanstalk and Springloops have the ability to export and import SVN dumps via easy-to-use web interfaces. This really could be as easy as download, upload. Try that first.

We had a couple of problems though. Previously we had a mishmash of company repos and project repos; these had to be merged and sorted. We also had different usernames on each system(!) which meant that during an import previous commits were not matched to current system users.  The author attribute needed to be updated for all previous revisions.

This was all done on OS X, but should be applicable to any Unix-like with the appropriate libraries, etc.  So we've got our dump from Beanstalk, now we just need to create a local repository to do our work on (always work on a backup!!).

$ svnadmin create --pre-1.4-compatible newrepo
We use the pre 1.4 compatible flag to overcome files system changes within SVN between versions. These changes can potentially cause errors (svn: Expected FS format '2'; found format '3') when propset-ting revision histories, in my case, author/committer names.

Next job, import your dump file.

$ svnadmin load newrepo < dumpfile
If you're looking to do the merging, as I was, then you want to make yourself a directory in your repository (usual 'svn mkdir' commands) and then load it in the following fashion:
$ svnadmin load newrepo --parent-dir myfolder < seconddumpfile
Ok, we've done our merging, now we're going to update our author histories.  Now the SVN manual gives you information on doing this one version at a time with a propset.  It also talks about other recursive actions such as deleting files, which isn't our concern.  For changing authors, I found a tidy script called svn-author-tweak.py from CollabNet.

If you want to give your repository a check before you upload it, just checkout to a local test.

$ svn co file:///path/to/newrepo /path/to/test/repo
Once that's done, dump the file.
$ svnadmin dump newrepo > my.dumpfile
Upload

???

Profit

Has it really been that long?

Mon 25 Jan 2010

Hello all, it's been a while. The Skinofstars site has been languishing in disuse for some time now. Like I'm sure many others, I've found the transition to micoblogging is all too easy. Sometimes though, one wants to write something a little longer and long gaps are not helpful when you finally think of something. So I guess I'm just posting to get rid of some writers block really.

For the six months since my last post (my, that is a long time!) I've been working at Studio Lift in Reading. There are five of us; two designers, two coders and a multi-talented boss. We fill our days making like this and like this (bad linking! :) using Movable Type. This is the same blogging platform that is used by the BBC, The Guardian, ReadWriteWeb and various others. It comes in both Commercial and Open Source offerings and is perhaps one of the most venerable of blogging systems.

Does that mean I'm going to talk tech now?... sure (jump?). Movable Type (mt) has just released its 5th version. This places more emphasis on managing multiple blogs within a site structure. Very useful if you've ever tried to manage multiple blog instances (how many blogs do you think The Guardian has?). There is also a new emphasis on social communication (see Motion).

The system is written in Perl but because the publishing is static files you can drop pretty much any scripting language in without any problems. My current language of choice for server-side is PHP. You hook your language in with mt using their own markup derived syntax, and to be honest for a simple blog you never have to touch another language. Let's look at an example which will iterate over a collection of the last five entries:

<mt:Entries lastn="5"> <h2><mt:EntryTitle /></h2> <p><mt:EntryBody /></p> </mt:Entries>

There is documentation, with my favourite page being the tag reference, but otherwise there certainly isn't the same breadth of documentation as you would find with something like Wordpress. Perhaps the strong ties with the commercial side of the software, it was increasingly license prudish at the Open Source blogging party, has been a hindrance to a warm and fuzzy community embrace. Still, some big media hitters use it so they've certainly got something right.

Well, as I said, I work in Reading and my crappy car's wiper motor has broken so I've got to get up early and catch a bus. It's been nice to talk to you again. Thanks for putting up with my tech chatter, I expect that you'll get variation soon enough as we head towards the General Election :)

Night Night.

http://www.williamfiennes.com/

A Website Apart

Thu 25 Jun 2009

Hey all, just a quick one today. I just had a job interview and I was asked the question "which design websites do I frequent"? I ummed and erred a little before mentioning Digg and Slashdot. Not very design focused I know (except maybe Digg's design section). I also said that I trawl the blogs for Ideas, which is true. I neglected to mention one of my favourite sites though, one which each and every one of you should have in your Feed Reader: A List Apart. I love that site and I felt a little ashamed for forgetting it, so as penance I am reminding you all to check it out.

Final Degree Results

Tue 16 Jun 2009

Well if you're not going to blog about your degree results, what are you going to blog about? Firstly, I want to say that I'm happy to do it in the three year time period. I know that is what is normally expected, but I know so many people that are having to take extra semesters or even years that I'm happy to have just got on with it so I'm able to move on to other things. Anyway, let's cut to the chase;

I got a First Class Honours BSc in Multimedia Systems and Communication, Media and Culture.

Shocked I was. Smiling, but shocked. I knew that it was mathematically possible, but I was really just aiming to get a good 2:1. Perhaps that be seen as aiming low, but I didn't come to university to flog myself every night for a grade. Yes I wanted to do well, but what I really wanted was to spend the time thinking and learning more generally. I wanted to learn many things and being at university, in a learning environment, I could spend time discovering so many other things. For example, I developed a somewhat nasty habit of wanting to learn Linux stuff. Not just the technical system management but also how open source as an idea can be used in so many aspects of life... Anyway, the point is that I wasn't targeting a First, my target was to do everything well, just good and solid with treats thrown in here and there and still time to live a little.

I saw the grades themselves first before I knew the final result and I knew it was looking good. Nothing was below B+ and I had a nice collection of As. I was pleased to see my dissertation had got an A, I knew the coding was pretty good but I felt my sociological investigation of open source development had been a little weak. I'd also got 100% on my final web design module, which is unheard of (I suspect the tutor may have had some explaining to do there!) so I felt good for the 2:1. I checked my percentage and saw 68.7% (the boundaries are 40%=3rd, 50%=2:2, 60%=2:1 and 70%=1st). How tantalisingly close, within 2% of a First. I did as anyone would do and promptly posted what I believed to be my result to Twitter. Feeling pleased I decided to have a look around the results pages a little when I came across the line stating your degree result, "First Class Honours". I could do little more than point and look at my girl Emily who'd also just got her results, a 2:1 in Linguistics.

Well as it turns out, if you score four or more B+ or above in your final semester (my final year was 5 As and 3 B+s) then they lower the bar for a First to 69%. I'm guessing this is to allow for improvement over the two years. And of course we can't have decimals points in the percentage, they need to be rounded.. up in my case. I got a First within a margin of 0.7% (they round up all)! Twitter needed an update! With exclamation marks!!!

For one thing, the narrow margin certainly means I'm not complacent in the result. I know I could have done better and probably should have done. But it certainly makes the future look a little brighter. I don't expect prospective employers to be pulling my arm off, but when looking at future Masters I know that I will now have a greater choice. The biggest bonus though is how proud my family are, my Mum said she had a little cry. Not too bad a result for someone who left school at 16. Guess I need to hire a gown now.

Latest Web Design

Mon 04 May 2009

Hey All,

Just thought I would tell you all about a new site that I've created for homework. It's for an Oxford based band called Branch Immersion, a three piece acoustic outfit, some friends of mine. The site is hosted on the uni servers at the moment but I expect we'll host it here at SkinOfStars towers soon enough once they've bought their domain name and I've ported the static pages to Wordpress.

This is an original design and I must be honest, one I am very proud of. Please check it out at the temporary address (I'll update with the final address later):

http://wwwusers.brookes.ac.uk/06021836/u75131

http://skinofstars.com/branch_immersion

One Day Blog Hack

Tue 07 Apr 2009

Hey All,

I've decided to do a blog hack in a day and here you see the result. I was struggling with Drupal as a blogging platform, and frankly an anything platform, so I decided to move to the decidedly easier Wordpress. I'm not saying there is anything wrong with Drupal, it's a great platform. The problem is that it's built for so many tricks that you have to give it a real shove when you want something simple. For example, handling images. On a content sytem one would have thought that would be an obvious feature, but with drupal you have to go get a plugin. Madness I tell you! Not that getting Wordpress means I'll be bloggin frantically, but it makes life a lot easier.

So here is how I got from Drupal 6 to Wordpress 2.7 in a day:

  1. Backup the Drupal database & import the data into Wordpress

Moving around between platforms is quite common, so you'll often find a script to aid you in moving database info from one structure to another. Wordpress has many such scripts built in for many platforms, but for Drupal I got my assistance from Mike Smullin. I had to make some minor changes, for example I added this SQL statement to change my Drupal post_type 'story' to Wordpress's 'post'

UPDATE wpposts SET posttype = REPLACE(post_type,’story’,'post’);

Pretty easy stuff really. If you're going to do it yourself, make sure you do it locally on backup copies. I hosed a few before I got it right.

2.Theme Hack

Ahh yes, the inevitable theme quandry. I had thought about what I wanted Skinofstars.com to look like for a while, but I wanted to do it reasonably quickly as I hate it when these things hang around. My layout plan was simple enough. Only one or two blog posts on the front page with info on my other nettyness, like tweets. I also knew that I'd want access to other pages (as you find in the Further section.. not sure on that name). So I searched some Wordpress themes and came across Grid Focus. It seemed to have the right level of minimalism that I was looking for as well as reasonably suitable layout. In order for it to work for me though I had to make a few hacks including some JQuery magic to include my further section (hope you like the transitions) and some layout hacks for the differences between a narrow and wide content column (you'll see if you view this in single/comments mode).

  1. Content Update

Probably one of the most time consuming parts. Much of my old content was Uncategorised for no reason and lacked any tags. Many posts from back in the Blogger days didn't even have a title. I went through almost all of them (I've taken a break from the 1996 stuff) and finally managed to put these years of outpourings into some kind of order.

  1. Update to server

Well, that's just a bit of FTP and MySQL. Job done.

Boxfire 1

Tue 31 Mar 2009

I'm preemptively titling this post as Boxfire 1 as I know there is more info to come. My dissertation produced a website, or should I say that I have produced a website for my dissertation. Either way, it's a collaborative news filter for Oxford that relies on user interaction to find the most important news story for the area. Please try it out and tell me what you think:

http://boxfire.co.uk

Ruby On Rails, RSS and Atom feed parsing with Feed Normalizer and subsequent storage

Mon 23 Mar 2009

I've battled for days on this, but I now finally know how to parse feeds and store them in a database in Ruby On Rails. This won't be of much interest to the casual reader, but if you are scouring the web for an answer (as I was) then you will probably find this very useful:

class Feed &lt; ActiveRecord::Base
require_association 'post'
require 'feed-normalizer'
require 'open-uri'
require 'rss/2.0'

belongs_to :user
has_many :posts, :dependent =&gt; :destroy

#put some other stuff here for feed validation etc


def refresh_all
    refresh(Feed.find(:all))
end

def refresh(feeds)
    feeds.each do |feed|
        rss = FeedNormalizer::FeedNormalizer.parse open(feed.uri)
        rss.entries.each  do |item|
            post = Post.new(:feed_id =&gt; feed.id)
            post.link = item.url or raise "post has no link tag"
            post.title = item.title or "no title"
            post.content = item.content or "no text"
            post.created_at = item.date_published if item.date_published
            post.save
        end
    end
end

end

How We Read The Web

Sat 21 Feb 2009

I've been looking at some interesting research regarding the manner in which users read web pages. I'd come across click mapping previously (links below), software that records where users click, but the Nielsen Norman Group's eye tracking study follows where users actually look. Though their study tends to focus on commerce aspects (how much do users look at your adverts?) it is also fascinating stuff for those of us wanting to create clean and clear designs.

First thing that's worth noting, users rarely spend time looking where you want them to. They tend to follow common patterns, the most notable being the F pattern (a couple of quick horizontal scans of the page as we head down it). This means it is for us, the designer, to be aware of this and place our most important content in these areas. One might argue that a regular visitor would know where the most important information on a site is held, yet anyone with any sense knows that we want to make a site clear for everyone.

Now for banners and adverts/promotions. I'm not going to say how to get people to read them (in fact, I'd recommend getting AdBlock Plus to just cleans the web of them!), but if you want to ensure people read all information on your page then make sure that it doesn't look like an advert. Users have an automatic tendency to ignore anything that looks like a promotion.

Next up I'll point to Nielson's study on how pages are read. The key point is that people don't read, they scan. If you want to make life easy then you could put important anchor words in bold to aids the reader down the route you'd like them to take. When you're marking this up in HMTL consider whether you should use the 'b' or the 'strong' tags. Are you merely creating a visual guide (b/i) or do you want to emphasise a word (strong/em)?

The final point I'll pick up from Nielson is his discussion on screen sizes. Nothing surprising here, most people use 1024x768, that's a laptop widescreen. One thing I'd like to add to is the misconception that laying out a web page is like laying out for a newspaper or a magazine. Screen sizes and resolutions are not fixed, there is no 'above the fold', like we find in newspapers. Even different choices of preferred system fonts or different browsers have an impact on where the cut-off will be on different machines. Interestingly, Neilson does point towards making site layouts fluid for different. Though I don't consider this such a hard n' fast rule, I'd like to point you CSS monkey's 456 Berea St's article on elastic layouts.

That's it for this week folks. Happy building.

Some further linkage:

  • A wordpress clickmap: http://www.rogerstringer.com/projects/wpclickmap

  • A more general use clickmap using PHP and JQuery: http://css-tricks.com/tracking-clicks-building-a-clickmap-with-php-and-jquery/

  • Strong or Bold? http://www.think-ink.net/html/bold.htm