Script to check urls availability

Today I just prepared a script to check the urls availability.

Scenario:

You have a web page (info page) listed all of your important tools and services (urls). And every time post machine maintenance, you have to check those urls one by one to check their availability (in other word, whether the links are broken or not).

Solution:

Perl provides:

1. LWP::Simple — It can help you fetch the contents of a page and it provides header() method to help you check the connection. the header() method fetches the remote document’s matainformation without downloading the whole document so it is very fast.

use LWP::Simple;

$content = get ($url);

Then it will fetch the content of $url and store it in $content.

2. HTML::LinkExtor — It help extract the urls from content.

The logic:

1. Get the content of the page.

2. Parse the urls from the page.

3. Check connection with those urls.

4. Generate reports. (In my script, it will generate the report and name it with a timestamp. Also, it will just retain only 25 reports. Besides that, I utilize $parser->parse_file to load content from local filesystem supposed even your info page is unavailable as well as provides a text file to help import additional urls which are not listed on your info page!).

Code:

#!/local/bin/perl -w

use CGI qw(:standard);
use CGI::Carp qw ( fatalsToBrowser );
use LWP::Simple qw(!head);
use HTML::LinkExtor;
use Time::localtime;

my $query = new CGI;

$reportlink = "";

my $debug = 1;
sub debugPrint
{
 if ( $debug eq 0 ) {
  print "@_ \n";
 }
}

backup();
my $base_url = qq(http://abc.com/all_tools_information.html);
my $base_backup = "all_tools_information.html";
my $report_loc = "/home/abc/report";
my $additional_url = "urls.txt";
my $datestr = GetDate();
my $filename = "Links_Status_Report_" . "$datestr" . ".txt"; 
my $report = "$report_loc" . "/" . "$filename";
open OUT, ">$report";
debugPrint $report;

$parser = HTML::LinkExtor->new(undef,$base_url);
$html = get($base_url);
if (defined($html)){
	debugPrint "It is parsing remote";
	$parser->parse($html);
} else {
	debugPrint "It is parsing local";
	$parser->parse_file("$base_backup");
}

@links = $parser->links();
foreach $linksarray (@links) {
	my @element = @$linksarray;
	my $elt_type = shift @element;
	while (@element) {
		my ($attr_name, $attr_value) = splice(@element, 0, 2);
		chomp($attr_value);
		debugPrint $attr_value;
		if ($attr_value =~ m/(ftp|http|https).*(html|pl|jsp)$/){
			print OUT "$attr_value ", LWP::Simple::head($attr_value) ? "OK" : "!!!***BAD***!!!", "\n";
		}
		#$seen{$attr_value}++;
	}
}

open IN, $additional_url or die "can't open $additional_url";
while (<IN>){
	chomp($_);
	if($_ =~ m/\S+/){
		print OUT "$_ ", LWP::Simple::head($_) ? "OK" : "!!!***BAD***!!!", "\n";
	}
}
close IN;
close OUT;
system("chmod 777 $report");

print $query->header ( );
print <<END_HTML;
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Tool Tracker</title>
<style type="text/css">
img {border: none;}
</style>
</head>
<body>
<p>*** Tool Tracker - Checking Availability ***</p>
<p><a
href="http://abc.com/report/$filename">$filename</a></p>
</body>
</html>
END_HTML

sub GetDate {
    my $tm                   = localtime;
    my ($year, $month, $day) = ($tm->year+1900, ($tm->mon)+1, $tm->mday);
    my ($hour, $min, $sec)   = ($tm->hour, $tm->min, $tm->sec);

    my $date = sprintf("%4s%2s%2s", ${year},${month}, ${day});
    my $time = sprintf("[%2s:%2s:%2s]", ${hour}, ${min}, ${sec});
    my $date_time = sprintf("%4s%2s%2s%2s%2s%2s", ${year},${month}, ${day}, ${hour}, ${min}, ${sec});

    #
    ## Pad spaces with zeros
    #
    $date =~ s/ /0/g;
    $time =~ s/ /0/g;
    $date_time =~ s/ /0/g;

    #
    ## Return correct date and/or time string
    #
    return $date_time;
}

sub backup{
	my @files = <*>;
	my @timestamps = ( );
	my $count = 0;
	my $lowest = 0;
	 foreach $file (@files) {
	   chomp($file);
	   if ($file =~ /Links_Status_Report/){
		   $count++;
		   my $temp = 0; 
		   $temp = substr($file, 20, 14);
		   if ($count == 1) {
			$lowest = $temp;
		   }
		   push @timestamps, $temp;
		   debugPrint("lowest: $lowest \n");
	   }
	} 
	if ($count >= 24){
	   system("rm Links_Status_Report_$lowest.txt");
	}
}

# pls. note,
You might encounter issue described in http://blog.netscraps.com/bugs/prototype-mismatch-sub-mainhead-vs-none.html. In my script, I have fixed it.