Dedup series.info file from a Wikipedia episode list

Oatcake

Member
The following perl script can be used to generate a series.info file from a Wikipedia style series list.

This is the procedure I follow:-
  1. select and copy the section of text from the Wikipedia series list starting with the "Series <X>" title that is immediately before the episode list and down to the last episode. (Episode information is presented in tables).
  2. paste this text into a temporary text file, say "/tmp/a.txt" using a text editor.
  3. run the perl script with the command:- perl convert.pl /tmp/a.txt /tmp/series.info
The script is very basic, so there's probably lots of scope for improvement. Consider it with a "BSD" licence, so please feel free to use, edit and post your own improvements.

This is what the "a.txt" file looks like, pasted from my browser...
Code:
Season 1 (1989–90)
Main article: The Simpsons (season 1)
No.
overall    No. in
season    Title    Directed by    Written by    Original air date    Prod.
code    U.S. viewers
(millions)
1    1    "Simpsons Roasting on an Open Fire"    David Silverman    Mimi Pond    December 17, 1989    7G08    26.7[1]
2    2    "Bart the Genius"    David Silverman    Jon Vitti    January 14, 1990    7G02    24.5[1]
...

When you run the script, you'll get screen output like this...
Code:
Unparsed line Main article: The Simpsons (season 1)
Unparsed line No.
Unparsed line overall    No. in
Unparsed line season    Title    Directed by    Written by    Original air date    Prod.
Unparsed line code    U.S. viewers
Unparsed line (millions)

Just quickly scan this output to make sure that it doesn't contain any episode numbers/ names. All lines with actual episode names should be parsed by the script. After running, you'll get an output file like this...
Code:
Simpsons Roasting on an Open Fire ==> S01-01
Bart the Genius ==> S01-02
...


Here's the actual script. It simply looks for lines that update the series number, like "Series #", and all lines that contain a number followed by a title string within double quotes...
Perl:
#!/usr/bin/perl -w

use strict;

if ($#ARGV != 1) {
    die("ARGS  <name to scan> <series.info file>\n");
}
my $file=$ARGV[0];
my $opfile=$ARGV[1];

if ( ! -e $file ) {
    die "Could not find $file\n";
}

my $in;
unless (open($in, "<$file")) {  
    die("Open $file\n");
}

unless (open(OUT, ">$opfile")) {  
    die("Open $opfile\n");
}

my $s=1;
my $last=0;
while (<$in>) {
    my $t = $_;
    chomp $t;

    if ($t =~ /^Series (\d+)/i ||
        $t =~ /^Season (\d+)/i) {
        my $s2=$1+0;
        if ($s2 > $s+1 || $s2 < $s) {
            print "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n";
            print "!!!! Series jump $s to $s2 !!!\n";
            print "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n\n";
        }
        $s = $s2;
        $last = 0;
    }
    elsif ($t =~ /([\d]+)\s+\"([^\"]+)/) {
        my $ep_num=$1;
        my $name=$2;

        if ($ep_num < $last) {
            if ($ep_num != 1) {
                die("Why has episode num gone down? $ep_num  $name\n");
            }
            ++$s;
        }
        elsif ($ep_num != $last+1) {
            die("Why has episode num jumped? $ep_num  $name\n");
        }
        $last = $ep_num;

        if ($ep_num < 10) {
            $ep_num = "0" . $ep_num;
        }
        print OUT "$name ==> S";
        if ($s<10) {
            print OUT "0";
        }
        print OUT "$s-$ep_num\n";
    }
    else {
        print "Unparsed line $t\n";
    }
}
close($in);
close(OUT);
 
You can run this on the Humax if you install a Perl.

For the perl package, replace the first line with:
Code:
#!/bin/env perl

# as we can't pass -w with env
use warnings;
...
In future CF versions /usr/bin/env should exist, which is more normal (eg Debian).

For the microperl package, replace the first three lines with:
Code:
#!/bin/env microperl
...
This Perl doesn't know about use strict;, etc.

Falling back to Awk (the gawk package or the Awk in the busybox package) you can process an entire episode list webpage from Wikipedia with this page2series script adapted from OP's Perl (but obviously you can't parse HTML with regular expressions):
Code:
#!/bin/sh
# Would prefer '#!/usr/bin/env awk -f' but (a) that tries to run a file 
# named 'awk -f' (b) env isn't there yet in CF. So punt to run this file 
# with awk from sh. awk sees this as a pattern, so add && 0 to prevent 
# the default print action. Also prefer --exec|-E to -f but not standard.
exec awk -f "$0" "$@" && 0

function die( msg, code) {
	print msg >>"/dev/stderr";
	exit code;
}

BEGIN {
	s=1; last=0; found=0; eps[1] = 0; x= 3
	line = "";
	}

	{
	line=line $0;
	}

# IGNORECASE is not standard
found == 0 && match(line,/<[Ss][Pp][Aa][Nn] [^>]+>([Ss][Ee][Rr][Ii][Ee][Ss]|[Ss][Ee][Aa][Ss][Oo][Nn])[[:blank:]]+[0-9]+/) > 0 {
		line = substr( line, RSTART);
		sub( /^<[^>]+>[[:graph:]]+[^0-9]/, line );
		match( line, /[0-9]+/ );
		s2 = substr( line, RSTART, RLENGTH );
        if (s2 > s+1 || s2 < s) {
            printf "%s\n%s\n%s\n", 
            	"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!", 
            	sprintf("!!!! Series jump %s to %s !!!", s, s2), 
            	"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n" >>"/dev/stderr";
		}
        s = s2;
        last = 0;
        found = 1;
    }

found == 1 && match(line,/<[Tt][Aa][Bb][Ll][Ee][[:blank:]]+[Cc][Ll][Aa][Ss][Ss]="([^"]+ )?wikiepisodetable( [^"]+)?"[^>]*>/) > 0 {
		found = 2; # in table
		line = substr( line, RSTART+RLENGTH);
	}

found == 2 && match(line,/<[Tt][Rr]( [^>]*)?>/) > 0 {
		found = 3; # in row
		line = substr( line, RSTART+RLENGTH);
	}

found == 3 && match( line, /<[Tt][Hh][[:blank:]]([^>]*[[:blank:]])?id="ep[0-9]+"([[:blank:]][^>]*)?>/) > 0 {
		# episode cells
		line = substr(line, RSTART+RLENGTH);
		found = 4;
	}
	
# match the global episode number if the per-series number is absent
found == 4 && (match( line, /<[Tt][Dd]([[:blank:]][^>]*)?>[0-9]+([[:blank:]]|<[^/>]+\/?>|[0-9])*<\/[Tt][Dd]>/) > 0 ||
				match( line, /^[0-9]+([[:blank:]]|<[^/>]+\/?>|[0-9])*<\/[Tt][Hh]>/) > 0) {
		# although the episode number is embedded here some rows may cover more than one episode, so find the cell data
		eptext = substr(line, RSTART, RLENGTH); 
		line = substr(line, RSTART+RLENGTH); 
		gsub(/<[^>]+>+/, " ", eptext);
		split( eptext, eps );
		ep_num = 0 + eps[1];
		if (ep_num < last) {
			if (ep_num != 1) {
			    die(sprintf("Why has episode num gone down? %s %s %s", ep_num, s, last));
			}
			++s;
		} else if (ep_num != last+1) {
			    die(sprintf("Why has episode num jumped? %s %s %s", ep_num, s, last));
		}
		last = ep_num;
	}

# assume the name follows the episode number
found == 4 && match( line, /<[Tt][Dd][[:blank:]]([^>]*[[:blank:]])?class="summary"([[:blank:]][^>]*)?>".+<\/[Tt][Dd]>/) > 0 {
		name = substr(line, RSTART, RLENGTH);
		sub(/^<[^>]+>/, "", name );
		line = substr(line, RSTART+RLENGTH-length(name));
		match( name, /<\/[Tt][Dd]>/);
		name = substr( name, 1, RSTART - 1);
		line = substr(line, RSTART);
		# remove <sup>...</sup>
		while ( 0 < gsub(/<sup [^>]+>.*<\/sup>/, "", name ));
		# remove other nested tags except <a>...</a>
		while ( 0 < gsub(/<[^Aa][^>]+>[^<]*<\/[^>]+>/, "", name ));
		# remove all remaining tags
		while ( 0 < gsub(/<[^>]+>/, "", name ));
		# unless show named after its segments, strip ""
		if (name !~ /",[[:blank:]]*"/) 
			gsub( /(^"|"[[:blank:]]*$)/, "", name );
		for (i in eps) {
			last = eps[i];
			printf "%s ==> S%02d-%02d\n", name, s, last;
		}
		
		found = 5;
	}

found >= 4 && match(line,/<\/[Tt][Rr]>/) > 0 {
		found = 2;
		line = substr( line, RSTART+RLENGTH);
	}

found >= 2 && match(line,/<\/[Tt][Aa][Bb][Ll][Ee]>/) > 0 {
		found = 0;
		line = substr( line, RSTART+RLENGTH);
	}

END { close("/dev/stderr"); }
If this is saved on the path and made executable, you can then do wget -q -O - '<wiki episode url>' | page2series:
Code:
# wget -q -O - 'https://en.wikipedia.org/wiki/List_of_French_and_Saunders_episodes' | /mod/src/page2series.awk
Beauty and the Beast ==> S01-01
Tricks ==> S01-02
Julie Walters ==> S01-03
Ratings ==> S01-04
Blue Peter ==> S01-05
Killing Time ==> S01-06
Decades ==> S02-01
Cable TV ==> S02-02
Removals ==> S02-03
...
Batman ==> S05-04
Pulp Fiction ==> S05-05
The Quick and the Dead ==> S05-06
Dr. Quimn, Mad Woman ==> S05-07
Back at the BBC ==> S06-01
Three-Letter Words ==> S06-02
The Queen ==> S06-03
Offers ==> S06-04
Minutes ==> S06-05
After Show Party ==> S06-06
#
If you do that with the Simpsons page it looks as if it's hung while it reads to the end of the 700k page, but it does come back.

Or you can use TheTVDB ...
 
Back
Top