Quick guide to Extract DVB-Subtitles from TS HD file and convert to SRT in minutes

Rockyrails

New Member
Hi guys,

I've been passively following this incredible forum for a long time and have finally signed up. Thanks for all the input and information - I have learned so much. I would like to give back a bit by showing how to quickly and painlessly extract subtitles from an HD TS file decrypted and copied to my PC, and convert to SRT or any other format you so wish. DVB subs are extremely limited in terms of the ability to play them back. I've been searching forever for a way to do this, but never found any concrete solutions...lots of suggestions that haven't worked or incredibly elaborate solutions that require all kinds of software but also never seem to work. Today, I found the solution which has been in front of me all this time. I have used an hour long HD TS file decrypted and copied to my PC, which has been edited in VideoRedo so as to keep the DVB Subtitles intact. Before you start;

(i) Download and install 'Subtitle Edit'.
(ii) Download 'Tesseract OCR' (tesseract-ocr-setup-3.02.02.exe) and install.
(iii) Copy & paste the 'tessdata' folder and 'tesseract.exe' file from C:\Program Files (x86)\Tesseract-OCR to
C:\Program Files (x86)\Subtitle Edit\Tesseract. Agree to move and replace the files of same name already
in folder.

Now we're good to go!

1. Here is my TS HD file with DVB Subtitles 'on' in VLC



2. Now open Subtitle Edit

upload_2015-10-10_17-31-43.jpeg

Drag your TS file onto the white area on the left side of the program, and wait between 30 secs and a minute for file parsing to complete (My one hour HD video took about 45 seconds to parse. Once completed, the program automatically opens the following window:

upload_2015-10-10_17-34-45.jpeg

Make sure to use the settings as pictured, with OCR Via Tesseract chosen in top left, and the settings exactly as shown on right side (you can play around with these when you feel more confident). Now, click 'Start OCR' button (Do NOT click OK!)...Let the program do its magic...My one hour verbally intensive TS file took 6 minutes to complete, and you can see the OCR in action.

3. When the OCR has finished its work, you'll be left with something a bit like this:

upload_2015-10-10_17-42-4.jpeg

You will be amazed at the incredible accuracy of the OCR via Tesseract setting. It really is about 99.8% accurate! Some of the problems it encountered and attempted to fix can be found on the right side window, but as you scroll down and compare those lines with the fixed lines in the bottom left window, you'll see that most have been fixed automatically by the program. You could spend a couple of minutes clearing the odd error if you want. The top window shows you the original DVB subtitles if you need something to compare against. Now, click 'OK' in the bottom right of the screen.

4. The program returns to its original window, and you can see your shiny new subtitles in the left window,
just waiting to be edited and saved!

upload_2015-10-10_17-50-46.jpeg

I normally click 'Tools - Fix Common Errors - Next - Apply Selected Fixes - OK', and also under Tools, I click 'Merge Short Lines' and also 'Split Long Lines'. That cleans up the subs so I am ready to save them. Then, go to File, and Save in whichever format you prefer. Save the file in same place as TS file and give it the exact same name as your TS file. (Very important!!)

5. Once saved and closed, open Subtitle Edit again, and drag your new subtitle file in. It will automatically
load the video onto the right hand part of the screen:



Make sure you click the 'Adjust' tab on bottom left. Now you can play the video and alter the position of the subtitles if you need to. I only really use the 'Set start and Off-set the rest' button under 'Adjust'. If the original DVB Subtitles were correctly in sync with the TS file, then you probably won't need to change any positions. Now save again!

6. So in order to use your new subtitle file, you need to convert your TS file to MKV. I use MKV Merge for this.

upload_2015-10-10_18-10-31.jpeg

As shown, drag in your TS file and your new SRT file, and hit the 'Start Muxing' button. In a few seconds, you have an MKV file with subtitles which will play on a huge number of devices.



All done! Please don't be put off by my instructions. With an hour of practise, you should be able to complete the entire process from start to finish in about 10 - 12 minutes (which includes the 6 minutes to parse the original TS file)! I have tested this on 4 of the 5 original terrestrial Freeview channels. Hope it helps.
 
Hi Black Hole, most subs can be extracted as files witout OCR (e.g. in your typical mkv, mp4, m4v formats...etc) but I think DVB-Subtitles in the form that Freeview provides them can only be done this way (is it something to do with being a bitmap?). Of course, I could be wrong....happens a lot :)......but if anyone knows a proven simpler way to get those DVB-Subs, please share. I've searched high and low on forums across the net and the only concrete information I've ever found is how to use a program called 'Project X' or something like that to extract subtitles from a broadcast Standard Def TS file, but it doesn't work with High Def files. With 'Subtitle Edit', the OCR engine is pretty fast and accurate beyond belief. In fact there were virtually zero spelling mistakes.
 
If, as you say, the Freeview subtitling is sent essentially as an image rather than character codes, then yes that explains the need for OCR. It's not something I have taken an interest in. I can see the advantages: there is be no reliance on the receivers having specific character set tables installed to translate character codes to on-screen images, so it is simple to add new symbols or support new languages at any time. On the other hand, it costs more in transmission bandwidth.

The characteristics of the character images are absolutely regular, unlike the noise one would get in a scanned page of printed text, so it is not surprising that OCR is perfect (more surprising if it is less than perfect). What surprises me more is the accuracy of the speech recognition that generates subtitles for live programmes.
 
The surprise is why so little software can display DVB Subtitles. It's an international standard for crying out loud!
 
Hi Rockrails,

I see it is over year ago that you started this thread, but I have been trying to replicate your method for extracting DVB subtitles from my Humax HDR FOX T2 without success.

I have tested this method on TS files from other sources and it works really well, but trying decrypted TS files from my HDR FOX T2 I always get a message from Subtitle Edit stating 'No subtitles found' after parsing on every one of them.

What exact settings and output did you use in VideRedo ?

From your post it seems that you decrypted and transferred from your recorder, but was that an HDR FOX T2 ?
 
If I play a decrypted HDR Fox T2 ts file with VLC I can turn on the DVB subtitles and they display correctly. So they are in there. You could try your file in VLC to check if the subtitles are present.
 
I've just followed these instructions for a .REC file from a Topfield, not an HD file though. They worked fine, very very quick! Impressed. I used to use Project X but found this to be very rewarding. Thanks.
 
Hello, I have belatedly found this thread and discussion on dvdsub to srt.
To begin, the dvbsub file is stand alone in a group of others for what was live presentation.

My initial question not knowing where to begin is:

To follow your instructions with Subtitle Edit with just the dvbsub loaded using
File/ load in the usual manner?

Or some other starting point?

I have completed the initial software installations and instructions for tesseract and Subtitle Edit.

I am subscribed for email notifications.

with appreciation
 
Progress today.

I had a newer TS file with dvb subs in it and am getting a result.

I was not waiting long enough at first for the parse routine to complete. I have prepared to print the guide and screen shots.

Will see how that goes shortly.

If those dvb files are just listed as subs separately, is there a method for doing them anywhere?
 
I tried this, but I also get "subtitles not found". They display fine in the VLCplayer.
When I open the ts file in the normal way (not dragging it onto the workspace) it is loaded, but I only get to see a little part of the video. And there is no window that opens, as you describe.
I copied the tesseract files to the Subtitle Tesseract folder.
What can I do?

Johanna
 
I thought I'd take a few minutes to revisit this old thread. I've had a whole series of issues with Subtitle Edit and have only recently taken the trouble to get them sorted. The procedure described below may seem convoluted, but once everything's up and running it's actually very quick.

1) Install Subtitle Edit (currently 3.5.7) and get it running. I used the portable version which comes complete with Tesseract for doing the OCR. If SE complains about missing codecs, there are various possible solutions suggested, not all of which seem to work. I installed Media Player Classic Home Theatre and pointed SE at it via Options-Settings-Video Player, after which everything worked fine. (Not so using the VLC option :().

2) If you haven't already got it, install MKVToolnix GUI. SE doesn't like large .ts files with embedded DVB subs: MKVToolnix provides a simple workaround.

3) Download your decrypted .ts file from the Hummy to your PC, using whatever method you prefer.

4) Open the .ts file in MKVToolnix and deselect all streams except the DVB subtitles:
1540902300239.png
Hit the Start multiplexing button. Within a few seconds this will produce a file with a .mks extension. These are your subtitles, but they still need to be converted to text that can be saved as a .srt subtitle file.

5) Open the .mks file in Subtitle Edit. It will immediately open in the OCR screen. Just click Start OCR and you're off. I won't describe the OCR process in detail because it's pretty intuitive. You'll be prompted to decide what to do about doubtful words, including names and British English spellings.

6) When finished, you may want to remove some lines at the start and end that belong with the preceding or following programme. The subtitles should already be in sync with the video in the .ts file, but if you subsequently edit this file in something like VideoRedo or Avidemux, you will need to adjust the timings in the subtitle file to match: this can be done very easily in SE. Note that editing a .ts file in Avidemux will remove any DVB subtitles present, so you must extract the subs from the original unedited .ts file.

7) Save the subtitles in .srt format. To be detected automatically, they need to have the same filename (other than the extension) and be in the same folder as the video file.

I did say it would sound convoluted and I confess that it's a bit nerdy. However, do it a couple of times and it's really quite straightforward and produces accurate subtitles for use on other platforms. Remember that subtitles for most TV dramas can be downloaded from a number of sites (Subscene is very good) though Subtitle Edit will still prove useful to synchronise downloaded subs with your Hummy files.
 
You'll be prompted to decide what to do about doubtful words, including names and British English spellings.
I wonder why that is. Is it not the case that the .srt simply contains a transcription of the rasterised subtitles from the subs stream? In which case, the OCR only needs to recognise each letter and spelling should be irrelevant.
 
The OCR process also spell-checks the text. The default dictionary is US English so it asks for confirmation where words occur with British spellings.
 
In all OCR that I've used some errors occur, typically in the recognition of certain combinations of letters, with some fonts being more susceptible than others. OCR always includes a spell check so that these can be corrected. The spell check can be turned off, or additional dictionaries can be loaded to avoid problems with variant English spellings. Corrections can also be saved to a user dictionary, so British spellings are recognised once they've been identified once.

It's worth saying that getting srt subtitles from a DVB source is only useful very rarely, given the availability of subs online. Subs for BBC programmes can be obtained using get_iPlayer.
 
In all OCR that I've used some errors occur, typically in the recognition of certain combinations of letters, with some fonts being more susceptible than others. OCR always includes a spell check so that these can be corrected.
That's fair enough... on scans of printed text. In this case, the input is clean and every example of the same character should be identical. I may be missing something, but it doesn't seem too hard to me.

Maybe, instead of optimising the software for this situation, they've simply used some stock OCR.
 
Back
Top