Oatcake
Member
The following perl script can be used to generate a series.info file from a Wikipedia style series list.
This is the procedure I follow:-
This is what the "a.txt" file looks like, pasted from my browser...
When you run the script, you'll get screen output like this...
Just quickly scan this output to make sure that it doesn't contain any episode numbers/ names. All lines with actual episode names should be parsed by the script. After running, you'll get an output file like this...
Here's the actual script. It simply looks for lines that update the series number, like "Series #", and all lines that contain a number followed by a title string within double quotes...
This is the procedure I follow:-
- select and copy the section of text from the Wikipedia series list starting with the "Series <X>" title that is immediately before the episode list and down to the last episode. (Episode information is presented in tables).
- paste this text into a temporary text file, say "/tmp/a.txt" using a text editor.
- run the perl script with the command:- perl convert.pl /tmp/a.txt /tmp/series.info
This is what the "a.txt" file looks like, pasted from my browser...
Code:
Season 1 (1989–90)
Main article: The Simpsons (season 1)
No.
overall No. in
season Title Directed by Written by Original air date Prod.
code U.S. viewers
(millions)
1 1 "Simpsons Roasting on an Open Fire" David Silverman Mimi Pond December 17, 1989 7G08 26.7[1]
2 2 "Bart the Genius" David Silverman Jon Vitti January 14, 1990 7G02 24.5[1]
...
When you run the script, you'll get screen output like this...
Code:
Unparsed line Main article: The Simpsons (season 1)
Unparsed line No.
Unparsed line overall No. in
Unparsed line season Title Directed by Written by Original air date Prod.
Unparsed line code U.S. viewers
Unparsed line (millions)
Just quickly scan this output to make sure that it doesn't contain any episode numbers/ names. All lines with actual episode names should be parsed by the script. After running, you'll get an output file like this...
Code:
Simpsons Roasting on an Open Fire ==> S01-01
Bart the Genius ==> S01-02
...
Here's the actual script. It simply looks for lines that update the series number, like "Series #", and all lines that contain a number followed by a title string within double quotes...
Perl:
#!/usr/bin/perl -w
use strict;
if ($#ARGV != 1) {
die("ARGS <name to scan> <series.info file>\n");
}
my $file=$ARGV[0];
my $opfile=$ARGV[1];
if ( ! -e $file ) {
die "Could not find $file\n";
}
my $in;
unless (open($in, "<$file")) {
die("Open $file\n");
}
unless (open(OUT, ">$opfile")) {
die("Open $opfile\n");
}
my $s=1;
my $last=0;
while (<$in>) {
my $t = $_;
chomp $t;
if ($t =~ /^Series (\d+)/i ||
$t =~ /^Season (\d+)/i) {
my $s2=$1+0;
if ($s2 > $s+1 || $s2 < $s) {
print "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n";
print "!!!! Series jump $s to $s2 !!!\n";
print "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n\n";
}
$s = $s2;
$last = 0;
}
elsif ($t =~ /([\d]+)\s+\"([^\"]+)/) {
my $ep_num=$1;
my $name=$2;
if ($ep_num < $last) {
if ($ep_num != 1) {
die("Why has episode num gone down? $ep_num $name\n");
}
++$s;
}
elsif ($ep_num != $last+1) {
die("Why has episode num jumped? $ep_num $name\n");
}
$last = $ep_num;
if ($ep_num < 10) {
$ep_num = "0" . $ep_num;
}
print OUT "$name ==> S";
if ($s<10) {
print OUT "0";
}
print OUT "$s-$ep_num\n";
}
else {
print "Unparsed line $t\n";
}
}
close($in);
close(OUT);