PHP Script for RSS auto-discovery and OPML file generation
Hey All,
I recently got a reasonable size list of blog URLs. What I wanted was to import all these into a feed reader (via OPML). There seemed to be a lack of conversion scripts for batch URL->find RSS link->feed reader import file (I may be wrong, please let me know if I am :), so I made one in PHP. I guess this is like an automatic-blogroller. I have just used this as a command line script, I’m not recommending you use this in ‘the wild’ as one might say, I have made little concession to security as I had a trusted list of URLs.
There are basically three steps to this
- Take an input file of newline seperated URLs, in my case blogs.
- Find (auto-discover) associated RSS feed of each blog URL
- Output an OPML file that you can use to import into a feed reader What it does:
- Takes a well formed list of newline separated URLs of blogs and turns it into an OPML
- If the URL source doesn’t contain a <link> to an RSS feed in the head it doesn’t add it to the OPML
-
Detects the <title> and adds that to the OPML text field, or uses the URL if <title> isn’t present What it doesn’t:
- Check the RSS feed is validated XML
- Any other checking really :)
- Come with any sort of warranty/guarantee Some of the key functions are from Keith Devens work. Thanks. Without any further ado, here is the script:
<?php
/*
* @author @skinofstars Kevin Carmody
* GPLv3 - http://www.gnu.org/copyleft/gpl.html
*
* this is really a command line app with no flags
* for turning a bunch ofurls into an OPML file
*
* 1.takes input file of newline seperated urls, normally blogs
* 2.finds (autodiscovery) associated rss of each url
* 3.outputs an OPML file for you to use in a feed reader
*/
// file config
$inputFile = "/path/to/URLlist.txt";
$outputFile = "/path/to/blogroll.opml";
// OPML config
$opmlTitle = "Some Select Blogs";
$opmlOwnerName = "Kevin Carmody";
$opmlOwnerEmail = "kevin@skinofstars.com";
/** no need to edit after this :) **/
$inHandle = @fopen($inputFile, "r");//read-only
$outHandle = @fopen($outputFile, "a");//append
if ($inHandle && $outHandle) {
$headerOut = opmlHeader($opmlTitle,$opmlOwnerName,$opmlOwnerEmail);
fwrite($outHandle,$headerOut);
while (!feof($inHandle)) {
$buffer = fgets($inHandle, 4096);
$source = getFile($buffer);
$rssURL = getRSSLocation($source, $buffer);
$rssTitle = htmlentities(getTitleAlt($source));
if($rssURL){
if($rssTitle){
$entryOut = opmlEntry($rssURL,$rssTitle);
fwrite($outHandle,$entryOut);
} else {
$entryOut = opmlEntry($rssURL,$rssURL);
fwrite($outHandle,$entryOut);
}
//echo ".";//uncomment to print a dot to screen on each success, nice for seeing progress
} else {
echo "Fail on: ".$buffer;
}
}
$footerOut = opmlFooter();
fwrite($outHandle,$footerOut);
fclose($inHandle);
fclose($outHandle);
} else {
if(!$inHandle){
echo 'not got a handle on input file: '.$inputFile."\n";
die;
}
if(!$outHandle){
echo 'not got a got handle on output file: '.$outputFile."\n";
die;
}
}
echo "\nAll done :)\n";
/**
* basic opml header
* @param string $opmlTitle
* @param string $opmlOwnerName
* @param string $opmlOwnerEmail
* @return string
*/
function opmlHeader($opmlTitle,$opmlOwnerName,$opmlOwnerEmail){
$oheader = "<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>\n"
."<opml version=\"1.1\">\n"
." <head>\n"
." <title>".$opmlTitle."</title>\n"
." <dateCreated>".date("r")."</dateCreated>\n"
." <ownerName>".$opmlOwnerName."</ownerName>\n"
." <ownerEmail>".$opmlOwnerEmail."</ownerEmail>\n"
." </head>\n"
." <body>\n";
return $oheader;
}
/**
* just returns a test footer
* @return string
*/
function opmlFooter(){
$ofooter = " </body>\n"
."</opml>";
return $ofooter;
}
/**
* creates an XML entry for the OPML file
* @param string $feedURL
* @param string $feedTitle
* @return string
*/
function opmlEntry($feedURL,$feedTitle){
$outline = " <outline text=\"".$feedTitle."\" type=\"rss\" xmlUrl=\"".$feedURL."\"/>\n";
return $outline;
}
/**
* returns the page title extracted from source
* @param string $html
* @return string
*/
function getTitleAlt($html) {
if (preg_match('/<title>(.*?)<\/title>/is',$html,$found)) {
$title = $found[1];
return $title;
} else {
return;
}
}
/**
* http://keithdevens.com/weblog/archive/2002/Jun/03/RSSAuto-DiscoveryPHP
* public domain
*/
function getFile($location){
$ch = curl_init($location);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Connection: close'));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, 15);
$response = curl_exec($ch);
curl_close($ch);
return $response;
}
/**
* http://keithdevens.com/weblog/archive/2002/Jun/03/RSSAuto-DiscoveryPHP
* public domain
*/
function getRSSLocation($html, $location){
if(!$html or !$location){
return false;
}else{
#search through the HTML, save all <link> tags
# and store each link's attributes in an associative array
preg_match_all('/<link\s+(.*?)\s*\/?>/si', $html, $matches);
$links = $matches[1];
$final_links = array();
$link_count = count($links);
for($n=0; $n<$link_count; $n++){
$attributes = preg_split('/\s+/s', $links[$n]);
foreach($attributes as $attribute){
$att = preg_split('/\s*=\s*/s', $attribute, 2);
if(isset($att[1])){
$att[1] = preg_replace('/([\'"]?)(.*)\1/', '$2', $att[1]);
$final_link[strtolower($att[0])] = $att[1];
}
}
$final_links[$n] = $final_link;
}
#now figure out which one points to the RSS file
for($n=0; $n<$link_count; $n++){
if(strtolower($final_links[$n]['rel']) == 'alternate'){
if(strtolower($final_links[$n]['type']) == 'application/rss+xml'){
$href = $final_links[$n]['href'];
}
if(!$href and strtolower($final_links[$n]['type']) == 'text/xml'){
#kludge to make the first version of this still work
$href = $final_links[$n]['href'];
}
if($href){
if(strstr($href, "http://") !== false){ #if it's absolute
$full_url = $href;
}else{ #otherwise, 'absolutize' it
$url_parts = parse_url($location);
#only made it work for http:// links. Any problem with this?
$full_url = "http://$url_parts[host]";
if(isset($url_parts['port'])){
$full_url .= ":$url_parts[port]";
}
if($href{0} != '/'){ #it's a relative link on the domain
$full_url .= dirname($url_parts['path']);
if(substr($full_url, -1) != '/'){
#if the last character isn't a '/', add it
$full_url .= '/';
}
}
$full_url .= $href;
}
return $full_url;
}
}
}
return false;
}
}
Though this was really a one time hit for me it may well be useful to others. Please let me know if you can think of ways to improve it and I will update accordingly.
Thanks, Kevin