[dedup] command line de-duplication


Staff member
I've updated the dedup (command line de-duplication) package. This was one of the very early packages which has long been superseded by the web interface de-duplication for most people. It's bothered me for a while that the logic used by the two was different so this update unifies that. They both now use the same backend modules for the logic so will stay in step.

I use the command line tool to automatically batch process recordings as they are completed - I'm planning to roll that up into an auto-dedup package when I get some time.
Hi af123,

Thanks for the update. I thought I'd mention a couple of modifications I've made to my running version of this, in case they prove useful....

1. Another common prefix to remove - 'CBBC.', this often appears on episodes of Shaun the Sheep recorded in the morning that are also shown at the same time on the CBBC channel.

2. I added another line to process.jim to remove question marks from the file name as well as the other special characters, since rsync does not generally like question marks and usually fails to transfer the files.

    # Escape special characters to create the filename.
    regsub -all -- {[\/ &]} $syn "_" fn
    regsub -all -- {[?]} $fn "" fn

I may not have done this the absolute best way, but it works as far as I can tell. Adding the question mark into the first line ends up in it being replaced by an underscore, which looks odd if this is the last character in the name.

I still also use a modified copy of the old bash script version (/mod/bin/dedup) periodically in a crontab. I pass a second parameter (in addition to '-yes') into it telling it which folder to process. If I can also figure out how to get it to remove questionmarks I'll be onto a winner, but its a steep learning curve!

Thanks for that, I'll add in your changes to the next version. I expect the prefix list will grow over time and need changing whenever there is a staff change at any channel!

This new script will already take directory names as arguments - it just defaults to the current directory if none are provided, so you should be able to use this in your cron entries if you want. The only issue at the moment is that it will rename things which are still recording (as will the shell script version) - I'll fix that in the next update though.
The webif update I just pushed out has the updates to dedup in it. New CBBC prefix and removal of ? characters from filenames.
Are there any user configurable options for using dedup via the webif, I've tried using this option on a few folders over the years, but it always seems to perform the same way

Dedup does indeed deduplicate a folder full of recordings, but, not only does it rename to the text in the description, but also renames the text in Medialist Title, which means you lose the date and time a recording was made

Is there any way of check for a duplicate recording based on Synopsis and not do any rename, but merely move duplicate recordings
I think that everything Dedup does can now be replicated using Sweeper rules

There is a predefined rule set available for dedup/tidy so you could install those and then modify to suit your needs, either delete the renaming rules or change it to keep the timestamp in the new file name
I sometimes get remote scheduling matches were a programme is repeated at the weekend, is there a way for sweeper to recognise a recording date as being a weekday or a weekend or even a particular day of the week

For example, some programmes get repeated at a different time on a different day, but the times may change

As it is easy to work out what day of the week it is when creating a spreadsheet based on just a date, could that ease of use be migrated to sweeper