Skin of Stars

Icon

Kevin Carmody on machines, media and miscellanea.

PHP Script for RSS auto-discovery and OPML file generation

Hey All,

I recently got a reasonable size list of blog URLs. What I wanted was to import all these into a feed reader (via OPML). There seemed to be a lack of conversion scripts for batch URL->find RSS link->feed reader import file (I may be wrong, please let me know if I am :) , so I made one in PHP. I guess this is like an automatic-blogroller. I have just used this as a command line script, I’m not recommending you use this in ‘the wild’ as one might say, I have made little concession to security as I had a trusted list of URLs.

There are basically three steps to this

  1. Take an input file of newline seperated URLs, in my case blogs.
  2. Find (auto-discover) associated RSS feed of each blog URL
  3. Output an OPML file that you can use to import into a feed reader

What it does:

  • Takes a well formed list of newline separated URLs of blogs and turns it into an OPML
  • If the URL source doesn’t contain a <link> to an RSS feed in the head it doesn’t add it to the OPML
  • Detects the <title> and adds that to the OPML text field, or uses the URL if <title> isn’t present

What it doesn’t:

  • Check the RSS feed is validated XML
  • Any other checking really :)
  • Come with any sort of warranty/guarantee

Some of the key functions are from Keith Devens work. Thanks.

Without any further ado, here is the script:

<?php
/*
 * @author @skinofstars Kevin Carmody
 * GPLv3 - http://www.gnu.org/copyleft/gpl.html
 *
 * this is really a command line app with no flags
 * for turning a bunch ofurls into an OPML file
 *
 * 1.takes input file of newline seperated urls, normally blogs
 * 2.finds (autodiscovery) associated rss of each url
 * 3.outputs an OPML file for you to use in a feed reader
 */

// file config
$inputFile = "/path/to/URLlist.txt";
$outputFile = "/path/to/blogroll.opml";

// OPML config
$opmlTitle = "Some Select Blogs";
$opmlOwnerName = "Kevin Carmody";
$opmlOwnerEmail = "kevin@skinofstars.com";

/** no need to edit after this :)  **/
$inHandle = @fopen($inputFile, "r");//read-only
$outHandle = @fopen($outputFile, "a");//append

if ($inHandle && $outHandle) {
	$headerOut = opmlHeader($opmlTitle,$opmlOwnerName,$opmlOwnerEmail);
	fwrite($outHandle,$headerOut);

	while (!feof($inHandle)) {
		$buffer = fgets($inHandle, 4096);
		$source = getFile($buffer);
		$rssURL = getRSSLocation($source, $buffer);
		$rssTitle = htmlentities(getTitleAlt($source));
		if($rssURL){
			if($rssTitle){
				$entryOut = opmlEntry($rssURL,$rssTitle);
				fwrite($outHandle,$entryOut);
			} else {
				$entryOut = opmlEntry($rssURL,$rssURL);
				fwrite($outHandle,$entryOut);
			}
			//echo ".";//uncomment to print a dot to screen on each success, nice for seeing progress
		} else {
			echo "Fail on: ".$buffer;
		}
	}
	$footerOut = opmlFooter();
	fwrite($outHandle,$footerOut);

	fclose($inHandle);
	fclose($outHandle);
} else {
	if(!$inHandle){
		echo 'not got a handle on input file: '.$inputFile."\n";
		die;
	}
	if(!$outHandle){
		echo 'not got a got handle on output file: '.$outputFile."\n";
		die;
	}
}

echo "\nAll done :) \n";

/**
 * basic opml header
 * @param string $opmlTitle
 * @param string $opmlOwnerName
 * @param string $opmlOwnerEmail
 * @return string
 */
function opmlHeader($opmlTitle,$opmlOwnerName,$opmlOwnerEmail){
	$oheader = "<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>\n"
	."<opml version=\"1.1\">\n"
	."	<head>\n"
	."		<title>".$opmlTitle."</title>\n"
	."		<dateCreated>".date("r")."</dateCreated>\n"
	."		<ownerName>".$opmlOwnerName."</ownerName>\n"
	."		<ownerEmail>".$opmlOwnerEmail."</ownerEmail>\n"
	."		</head>\n"
	."	<body>\n";
	return $oheader;
}

/**
 * just returns a test footer
 * @return string
 */
function opmlFooter(){
	$ofooter = "  </body>\n"
	."</opml>";
	return $ofooter;
}

/**
 * creates an XML entry for the OPML file
 * @param string $feedURL
 * @param string $feedTitle
 * @return string
 */
function opmlEntry($feedURL,$feedTitle){
	$outline = "    <outline text=\"".$feedTitle."\" type=\"rss\" xmlUrl=\"".$feedURL."\"/>\n";
	return $outline;
}

/**
 * returns the page title extracted from source
 * @param string $html
 * @return string
 */
function getTitleAlt($html) {
	if (preg_match('/<title>(.*?)<\/title>/is',$html,$found)) {
		$title = $found[1];
		return $title;
	} else {
		return;
	}
}

/**
 * http://keithdevens.com/weblog/archive/2002/Jun/03/RSSAuto-DiscoveryPHP
 * public domain
 */
function getFile($location){
	$ch = curl_init($location);
	curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
	curl_setopt($ch, CURLOPT_HTTPHEADER, array('Connection: close'));
	curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
	curl_setopt($ch, CURLOPT_TIMEOUT, 15);
	$response = curl_exec($ch);
	curl_close($ch);
	return $response;
}

/**
 * http://keithdevens.com/weblog/archive/2002/Jun/03/RSSAuto-DiscoveryPHP
 * public domain
 */
function getRSSLocation($html, $location){
	if(!$html or !$location){
		return false;
	}else{
		#search through the HTML, save all <link> tags
		# and store each link's attributes in an associative array
		preg_match_all('/<link\s+(.*?)\s*\/?>/si', $html, $matches);
		$links = $matches[1];
		$final_links = array();
		$link_count = count($links);
		for($n=0; $n<$link_count; $n++){
			$attributes = preg_split('/\s+/s', $links[$n]);
			foreach($attributes as $attribute){
				$att = preg_split('/\s*=\s*/s', $attribute, 2);
				if(isset($att[1])){
					$att[1] = preg_replace('/([\'"]?)(.*)\1/', '$2', $att[1]);
					$final_link[strtolower($att[0])] = $att[1];
				}
			}
			$final_links[$n] = $final_link;
		}
		#now figure out which one points to the RSS file
		for($n=0; $n<$link_count; $n++){
			if(strtolower($final_links[$n]['rel']) == 'alternate'){
				if(strtolower($final_links[$n]['type']) == 'application/rss+xml'){
					$href = $final_links[$n]['href'];
				}
				if(!$href and strtolower($final_links[$n]['type']) == 'text/xml'){
					#kludge to make the first version of this still work
					$href = $final_links[$n]['href'];
				}
				if($href){
					if(strstr($href, "http://") !== false){ #if it's absolute
						$full_url = $href;
					}else{ #otherwise, 'absolutize' it
						$url_parts = parse_url($location);
						#only made it work for http:// links. Any problem with this?
						$full_url = "http://$url_parts[host]";
						if(isset($url_parts['port'])){
							$full_url .= ":$url_parts[port]";
						}
						if($href{0} != '/'){ #it's a relative link on the domain
							$full_url .= dirname($url_parts['path']);
							if(substr($full_url, -1) != '/'){
								#if the last character isn't a '/', add it
								$full_url .= '/';
							}
						}
						$full_url .= $href;
					}
					return $full_url;
				}
			}
		}
		return false;
	}
}

Though this was really a one time hit for me it may well be useful to others. Please let me know if you can think of ways to improve it and I will update accordingly.

Thanks,
Kevin

Moving hosted SVN, the trials and the tribulations

Over the last few weeks Mike Robinson and I have discussed and decided an SVN restructuring for improving our build and deployment processes. I would encourage you to read a bit more about that (and various other geekness) at his blog.

So I’ve spent this week moving our company hosted SVN from Beanstalk to Springloops. I feel I’ve been swinging between hell and zen, but the learning has been awesome. As a summary of what I’ve found I thought I’d give a quick walk through how I did it.

Most of this stuff was the usual dump/load cycle, but there are a couple of things which needed some extra attention.

Firstly, both Beanstalk and Springloops have the ability to export and import SVN dumps via easy-to-use web interfaces. This really could be as easy as download, upload. Try that first.

We had a couple of problems though. Previously we had a mishmash of company repos and project repos; these had to be merged and sorted. We also had different usernames on each system(!) which meant that during an import previous commits were not matched to current system users.  The author attribute needed to be updated for all previous revisions.

This was all done on OS X, but should be applicable to any Unix-like with the appropriate libraries, etc.  So we’ve got our dump from Beanstalk, now we just need to create a local repository to do our work on (always work on a backup!!).

$ svnadmin create --pre-1.4-compatible newrepo

We use the pre 1.4 compatible flag to overcome files system changes within SVN between versions. These changes can potentially cause errors (svn: Expected FS format ’2′; found format ’3′) when propset-ting revision histories, in my case, author/committer names.

Next job, import your dump file.

$ svnadmin load newrepo < dumpfile

If you’re looking to do the merging, as I was, then you want to make yourself a directory in your repository (usual ’svn mkdir’ commands) and then load it in the following fashion:

$ svnadmin load newrepo --parent-dir myfolder < seconddumpfile

Ok, we’ve done our merging, now we’re going to update our author histories.  Now the SVN manual gives you information on doing this one version at a time with a propset.  It also talks about other recursive actions such as deleting files, which isn’t our concern.  For changing authors, I found a tidy script called svn-author-tweak.py from CollabNet.

If you want to give your repository a check before you upload it, just checkout to a local test.

$ svn co file:///path/to/newrepo /path/to/test/repo

Once that’s done, dump the file.

$ svnadmin dump newrepo > my.dumpfile

Upload

???

Profit

About

My name is Kevin Carmody and I live in Oxford, United Kingdom. I am a web developer with a penchant for community sites and a pedantry for open standards.

This here is a collection of my thoughts and musings, a spot for pooling a little of what's rattling around. Thanks for taking the time to visit and I hope you enjoy your stay.