Filename detection algorithm

Dec 30, 2008 at 4:39 AM

source: http://bbs.wasnotwas.com/viewtopic.php?f=8&t=139&start=90#p1122

Re: TVScout - metadata fetcher

Postby FreQi on Wed Nov 12, 2008 8:38 pm

I've made a couple changes to the "episode discovery" bits of TVScout in the SVN last night, and after reading shado's post about his problems with some south park episodes, I realize now that the changes I've made won't entirely fix that. So I am back to the drawing board, again, to search for a better solution. So I'm just going to think out loud here. Maybe someone can chime in with a gem of wisdom for me.

To start, let me spell out the way that TVScout identifies episodes. It looks in the Series/Show folder for video files and then uses a series of patterns in a "first match wins" style to try and discover a season and episode number. It goes like this...

    First, it looks for "SAAEBB" somewhere in the file name. It can be anywhere (beginning, end, somewhere in the middle) and the numbers (AA and BB) can be one or two digits (I want to allow BB to be up to three digits). So if found, "AA" is captured as the Season, "BB" is captured as the Episode. Pretty straight forward. All other patterns are then skipped if the Episode is discovered.

    Second, it looks for "AAxBB" somewhere in the file name, same as above. AA and BB can be one or two digits (again, I want to allow BB to be up to three digits), and case is ignored so the x can be X too. Again, if the Episode number is found, all other patterns are skipped.

Now this is where things start getting sticky.

    Third, it checks to see if the filename STARTS WITH three or four digits and assumes a pattern of ABB or AABB. This probably doesn't help much since most people probably (hopefully?) name their files like one of the two patterns above, or they only lead the file names with the episode number (no season), or like shado mentioned earlier, the AABB pattern is not at the beginning of the file name. So I can change this to match 3 for 4 digits anywhere in the filename, but that could cause some serious problems. Particularly if/when people use this next pattern...

    Fourth, if the episode number still hasn't been found, it checks to see if the file begins with two digits (BB) and assumes that is the episode number. It's important to note that this is the default pattern that TVScout uses to rename files too ("BB - Episode.avi") so it is really important that this pattern is discoverable. A show like "24" can completely get hosed by this if it hasn't been matched to an earlier pattern and the files start with the series name.

    Finally, if an episode number is still not found, TVScout abandons the file, logs it as "Episode not Identified" and moves on to the next file. Otherwise if it' has found the number, it does some additional sanity checks before fetching the metadata (stuff like is this file in the right season folder? Does this episode number exist for this season?).

So for me, the real problem is with pattern3 because there is nothing to help identify a season and episode number from a plain block of digits. If I were to modify pattern3 to look for "ABB" anywhere in the file, this will adversly affect people who are actually using the fourth pattern to name their files. Why? Because TVScout would quickly start finding episode names that have three and four digits in them and assume that it's the Season/Episode character block from pattern3 and never make it to pattern4.

Think about episodes like South Park s1e04. If a user lets TVScout rename their files to the default "BB - Episode" convention, then this episode would be named "04 - Weight Gain 4000.avi". On subsequent runs of TVScout, it'll run through pattern 1 and not find SAAEBB, then pattern 2 and not find AAxBB. Then pattern 3 will find "4000" and identify it as "Season 40, Episode 00". At which point it'll skip pattern 4.

So the first question is why not swap the order of the patterns and search for "BB - Episode" and then "Series ABB Episode"? Sure, but now a series like '24' would have every episode get ID'd as episode 24 when named like "24 AABB Episode". So that's no good.

The only thing I can think of is do some more elaborate pattern searches, like looking for "BB - " or specifying some "presumed pattern" preferences or something. In the end, it all sounds like a hell of a lot of work beyond what we've already done so that people can have sloppy or lazily named files. I would rather take the stance that some of the responsibility lies with the users to make sure that their collections have some predictability in their file names. Meaning, do some pre-scrubbing to ensure a pattern match. Add an "x" between the season and episode numbers, run the scan and you'll be all good.

Also, to prevent a catastrophy like accidentially letting TVScout rename all your files that it might falsely identify, I've also added a cheap "undo" in my last SVN check-in. It creates a batch script in the root of the series folder that will undo any file renaming and file moves that TVScout last did to that series. From some preliminary testing it seems to be pretty effective. I decided to do it that way because sometimes you might not notice that TVScout incorrectly renamed a file for you and you might have already closed the application, loosing any history of the change. With the .cmd file, you can revert these changes well after the fact. The only down-side is it will orphan the metadata.

So, all that being said, I will continue to tweak the episode discovery code, and I am open to discussion on how to better attack the problem (I'll probably be making updates to the code shortly like making the SaaEbb and AxBB patterns one step and letting the episode number reach three digits). I don't want anyone to walk away from this thinking I am being a big jerk or something telling people to clean up their messy file collections because I think TVScout is a great way to get your stuff organized. But at the same time, I think a little pre-screening of the naming of your collections will go a long way towards the smooth operation and use of your Media...well beyond the scope of TVScout.
FreQi
 
Posts: 2
Joined: Wed Nov 12, 2008 4:49 pm
Dec 30, 2008 at 1:47 PM
Edited Dec 30, 2008 at 7:01 PM

Hi,

I've been using a naming scheme "Seriesname (S)S.EE - Episodename" for years. E.G. "Stargate Atlantis 1.04 - blablabla blablabla" or "Stargate SG-1 10.02 - Blablabla blablabla".
With "(S)" I mean that the first Season digit is not used when it's zero, i.e I use 1.02 (not 01.02), 5.22 (not 05.22), 10.06, 11.20.
Problem is, TVScout 0.8 messes up SG1's season 10: It sees every episode as "10.01 - The Quest Part I".
I found a workaround, which is changing the numbering to 10x04 etc., but wouldn't it be possible to allow the "." as a separator between the season number and the episode number ?
I'm certainly not the only one using this; I've seen this numbering scheme on many DVD series collections... and it looks nicer than using the "x" or "e" as a separator.

Moreover, any episode name that include a number (e.g. "300", "2001", "2010"), whether on its own or not, causes TVScout to fail to recognize the episode.
Workaround: put a space between each digit. Maybe this is something that's been fixed in later TVScout/MediaScout releases though.... as the usual forums aren't accessible, I can't look around for more info.

And lastly, I can't get any of the more recent betas to work. I've tried compiling various releases. In the best case, they run, but fail to recognize any episode of any series I've got (while TVScout 0.8 works for the same files). In the worst case, they crash every time, or the "fetch" button is permanently disabled (greyed out), then sometimes mysteriously becomes available for no clear reason after randomly changing some options, etc. etc.

EDIT: OK apparently I need to clear things up, so I've done a build-by-build comparison starting from the "released" TVScout 0.8 and up to the most recent MediaScout build. Here's what I've found.
First, the file & folder setup:

D:\
        TV SERIES
                Stargate Atlantis
                            Season 2
                                - Stargate Atlantis 2.10 - The Lost Boys [c].avi
                                        Extras
                                            - Stargate Atlantis 2.Extra - Mission Directive - The Intruder.avi
                                            - Stargate Atlantis 2.Extra - Profile - David Hewlett.avi
                Stargate SG-1
                            Season 5
                                - Stargate SG-1 5.02 - Threshold [c].avi
                                - Stargate SG-1 5.10 - 2001 [c].avi
                                        Extras
                            Season 10
                                - Stargate SG-1 10.05 - Uninvited [c].avi
                                - Stargate SG-1 10.06 - 200 [c2].avi
                                        Extras
                                                - Stargate SG-1 10.Extra - The Ori - A New Enemy.avi
                                           
I've only provided a few sample filenames here. In case you wonder, the trailing '[c]' means that I've muxed the Commentary track from the DVD as a 2nd audio track into the file. Each season has an 'Extras' folder in which I've put the specials and DVD extras that were released for that season. The extras are not numbers, so I use "Seriesname (S)S.Extra" as a file prefix. All files have a matching .SRT file located in the same folder.

 

1. TVScout Build 0.8 "Released":

Recognizes (S)S.EE episodes correctly, EXCEPT when either of the following is true:

season numbered 10 and above (all other seasons are fine)
episodes with numbers in the episode name
does not look into the "Extras" folder

Partial log:
 
Processing 'Stargate SG-1 10.03 - The Pegasus Project [c].avi'
S10E10 Quest (1)
Processing 'Stargate SG-1 10.04 - Insiders [c].avi'
S10E10 Quest (1)
Processing 'Stargate SG-1 10.05 - Uninvited [c].avi
S10E10 Quest (1)
Wrong Season (2): 'Stargate SG-1 10.06 - 200 [c2].avi'
Processing 'Stargate SG-1 10.07 - Counterstrike [c].avi'
S10E10 Quest (1)
Processing 'Stargate SG-1 10.08 - Memento Mori [c].avi'
S10E10 Quest (1)
Processing 'Stargate SG-1 10.09 - Company of Thieves [c].avi'
S10E10 Quest (1)
Processing 'Stargate SG-1 10.10 - The Quest Part I [c].avi'
S10E10 Quest (1)

 

2. TVScout Build 17767:

does not recognize any episode using (S)S.EE numbering syntax

does not look into the "Extras" folder

Partial log:

Processing 'Stargate Atlantis'Fetching Series BannerFound Season 2 (Season 2) Processing Season 2Episode not Identified: 'Stargate Atlantis 2.01 - The Siege Part III [c].avi'Episode not Identified: 'Stargate Atlantis 2.02 - The Intruder [c].avi'Episode not Identified: 'Stargate Atlantis 2.03 - Runner [c].avi'
Completed Stargate AtlantisScanning Stargate SG-1...Querying TV ID for Stargate SG-1Fetching MetadataMetadata retrieved. Processing...Caching MetadataProcessing 'Stargate SG-1'Found Season 10 (Season 10)Found Season 5 (Season 5) Processing Season 10Episode not Identified: 'Stargate SG-1 10.01 - Flesh and Blood [c].avi'Episode not Identified: 'Stargate SG-1 10.02 - Morpheus [c].avi'Episode not Identified: 'Stargate SG-1 10.03 - The Pegasus Project [c].avi'Episode not Identified: 'Stargate SG-1 10.04 - Insiders [c].avi'S10E10 Quest (1)
 

If, however, I rename all files to follow the (S)SxEE number syntax: everything works fine.

3. TVScout Build 17778: same as 17767

4. TVScout Build 17907: same as 17767
 

Note: "Movies" tab does nothing, no Fetch occurs (pointing to a single folder containing a number of different movies)
 

5. MediaScout Build 18257: same as 17907
 

'Browse' button in the 'TV' tab doesn't work
'Fetch Data' button in the 'TV' tab initially greyed out / unavailable (while TV folder pointing to 'd:\series' and batch processing enabled
'Fetch' in the 'Manage TV' tab doesn't work either, but 'Change Poster' does.
'Manage TV' tab contents don't refresh when changing path through 'Options' tab (must close the app)
'Fetch Selected' in the Movies tab (which is empty, actually) causes app crash
App crashes when starting up if TV folder pointing to the root of my E:\ drive (fixed by editing user config file)
Changed TV folder to the "Stargate SG-1" folder and turned off batch processing: no change

6. Build 20202:

crashes every time I try to 'fetch' through the TV tab (when the button works, which is not every time)

7. Custom Build based on 17907:
Now I've taken the build 17907 source and looked in ProcessMetatada.cs if I could tweak the code to accept '.' as a separator between season number and episode number. I've changed the following lines:

LINE 117:
Match m = Regex.Match(fiRoot.Name, "(?<se>[0-9]{1,2})x(?<ep>[0-9]{1,2})|(?<se>[0-9]{1,2}).(?<ep>[0-9]{1,2})|S(?<se>[0-9]{1,2})E(?<ep>[0-9]{1,2})|(?<se>[0-9]{1})(?<ep>[0-9]{2})", RegexOptions.IgnoreCase);

LINE 224:
Match m = Regex.Match(fi.Name, "(?<se>[0-9]{1,2})x(?<ep>[0-9]{1,3})|(?<se>[0-9]{1,2}).(?<ep>[0-9]{1,3})|S(?<se>[0-9]{1,2})E(?<ep>[0-9]{1,3})", RegexOptions.IgnoreCase);
 

Unfortunately, it doesn't work.
In the sample directory structure (way) above, Stargate Atlantis is processed correctly, but Stargate SG-1 (both seasons 5 and 10) return "wrong season (1))" errors. Here I'm stuck as I have no idea how it's possible that it gives this error for SG-1 S5 and not for Atlantis...

Would you have any idea how to fix this, or is there a fundamental problem with using "." as a separator here ?
If there's no other solution, I'll just have to start renaming my whole collection... :-)

P.S. keep up the excellent work, great tool !!

 

Dec 31, 2008 at 12:05 AM
Hi again,
another problem is the handling and naming of the "specials" or extras.
Some specials are included in theTVdb, e.g. Stargate SG-1 "From Stargate to Atlantis".
Then, strangely, "Continuum" and "Ark of Truth" have also been shoved into this category.
But now, how do we need to name the episodes for the matching to work ?
The only syntax I could find that works is as follows: (using build 17907):
'02 - From Stargate To Atlantis - A Sci-Fi Lowdown Part I.avi'
... so I've had to drop the series and season reference in front on the filename or it wouldn't work.
Worse, unless I put the file in a folder "Season 0" myself, it's still not recognized. So "move unsorted files" doesn't work. I get the error message:
"Season unknown: '02 - From Stargate To Atlantis - A Sci-Fi Lowdown Part I.avi'"

And finally, and most worrying, apparently none of the builds scans any file in the series folder itself, or in a subfolder of a season folder.
E.g. I'd put "Continuum" in the "Stargate SG-1" 'root' folder (as it's not really any season) and the specials in an folder "Extra" under the "Season" folder.
None of these get scanned.
Using "Season 0" is a possibility, of course, although sometimes annoying because:
1) often specials are for / about events in a specific season; wouldn't it be better to have a way to keep the specials with the season to which they refer, either in the season folder or in a subfolder ?
2) having the Stargate movies in a "Season 0" folder isn't very nice.
I must assume that the same problem would arise with the movies from Babylon 5, or with the "ending" movie for FarScape, The PeaceKeeper Wars...
I guess we're probably dependent on how TheTVDB.com decides to name the "season" to which it assigns specials and spin-off movies ?
Jan 3, 2009 at 12:17 AM
Edited Jan 3, 2009 at 1:24 AM
Hi aeoth,
thanks for the reply.

First of, all: I hope you get better soon, and all the best for 2009.

Well in the end I've renamed all my files to follow the syntax: "Series Name SSxEE - Episodename [extrastuff]" which works perfectly with build 17907 of TVScout. It also solves the problem with number in episode names -- no problem when the SSxEE syntax is used there (handles "Stargate SG-1 10x10 - blablabla" perfectly).
Just in case you'd wonder why I always put the full series name in every file ? => so that I get better results when doing searches using e.g. Windows Search or any other tool that returns the file name. Just getting "2. episode name" isn't very convenient when searching your whole collection. You'd want the series name and season as well. But again, this doesn't seem to be a problem with any file i've got using the SSxEE syntax.

No idea why the same "SSxEE" syntax didn't work for me for the "season 0" items of Stargate SG-1 before, but now it does (go figure), so I've applied that to all the Season 0 items of all series that I had just been changing to "EE - episode name"... oh well as long as it works :-)

I've been looking around @ thetvdb.com to find out how they handled a number of shows, and obviously the "season 0" trick is a good compromise.

Well, thanks and good luck !