One renaming to rule them all2021-11-20
Often in my day to day life I am faced with a list of (totally legal) files that look like this:
[VOSTFR].Some.Random.Serie.s1e1.(MULTi).1080p.x264.mkv
[VOSTFR].Some.Random.Serie.s1e2.(MULTi).1080p.x264.mkv
[VOSTFR].Some.Random.Serie.s1e3.(MULTi).1080p.x264.mkv
...
Or maybe a camera that's feeling very american yields me pictures with the format month-day-year_number.jpg
And whenever that happens to me I'm like "urgh, I wish these were named differently but I'm too lazy to regex it so I'll just leave them be.". Or sometimes, when there's a small number of them, I'll rename them by hand and wonder "is there really no simple tool that can do it for me ?". The pattern in the file names is usually very obvious, with some parts that are fixed and some that are variable (like episode/season numbers, or dates).
I felt like it should be simple for a program to:
- automatically identify those fixed and variable parts
- provide you a template name in the form of:
> [VOSTFR].Some.Random.Serie.s/a/
e/b/
.(MULTi).1080p.x264.mkv
with/a/
and/b/
representing variables in the file names- allow you to rename it to something like:
> Some Random Serie - S/a/
E/b/
.mkv- apply this renaming on files that follows this template
This hypothetical program would allow anyone to rename a batch of files without using regexes! So I looked for it, thinking this idea was too simple to not have already been implemented.
I couldn't find it.
Sure, there was some batch renaming freewares with tons of functionnalities, but as far as I could see, all of them used Regex, and where way too complex for my purposes. I had a eureka moment when I found the native Windows feature to rename multiples file, quickly followed by disappointement when I realised that it was just giving every file the same name succeeded by a (n) denoting the number.
Thus, in true programmer fashion, I spent 50+ hours making this program just to save myself a dozen seconds of renaming đ.
Going back to the rough plan I outlined, the first and biggest obstacle for our magical program is to automatically identify the fixed and variable parts given a list of file whose names follow a pattern. For the first version, I made a simplifying assumption: assume that all variables in file names are numbers. This is mostly true and covers our base case of episode/season numbers and dates.
With this supposition, the code to identify the fixed and variable parts of file names is simple:
I used Python for that, and made a basic UI with Tkinter to display the template name that was extracted, as well as a text field to edit the template:
I had no screenshot of this version so I had to draw it, sorry đ
The variables are represented with /a/ and /b/, I chose to use slashes to denote variables because it's a forbidden character in file names so there can be no confusion.
The user can then edit the template name, and every file is renamed following the new template, replacing the variables with their original value.
This worked pretty well, I made a cheesy little showcase here, made an .exe with PyInstaller and posted it to reddit, satisfied.
But of course that wasn't the end of it đ. The assumption that all file name variables are numbers broke a use case that I kept running into: it doesn't work when episode titles are in the file names. Additionnaly, a redditor suggested that the program let us edit a real title instead of a template and I liked the idea so a year later I started working on a second version!
So in this version, we want the program to be able to recognize title variables such as:
[VOSTFR] Some Random Serie - S1E01 - The Beginning (MULTi) 1080p x264.mkv
[VOSTFR] Some Random Serie - S1E02 - Trouble starts (MULTi) 1080p x264.mkv
[VOSTFR] Some Random Serie - S1E03 - New Player (MULTi) 1080p x264.mkv
...
But on the other hand, we don't want our program to associate unrelated files together just because they have words in common:
Big Game Hunter - meme.png
Game_of_life_installer.msi
Game of throne - a feast for crows.epub
...
So there is a "sensitivity" balance in what gets recognized as a title variable and what is not. Unrelated but I had to mention it somewhere: I wrote this version in Rust because that's a cool language I'm learning
In order to solve this new problem, I changed the template extraction part to something like this:
Slap some unit tests for debugging after some tweaking it did work!
Now the other big change left to do is allow the user to edit a real file name instead of a template with variables. The first idea that came to my mind was:
However, in a file name such as:
[1080p] Some Random Serie - S1E1 - The Beginning (MULTi.VOSTFR.x264).mkv
If the folder only contains season 1 (as if often the case), the 2 variables are "1" for the episode and "The Beginning" for the title. The title is fine but the value "1" appears 3 times in the file name! So if the user renames this file to this:
Some Random Serie - S1E1 - The Beginning.mkv
the program has to correctly deduce that the "1" coding the episode number is the 2nd one even though it was the 3rd one in the old file name! Simply assuming that the first "1" we see is the episode number won't cut it.
Thankfully, I came up with a relatively simple solution that works well in our case: taking the "fingerprints" of the variables, a count of all the characters that come before the variable value in the old file name. In the new file name we compare this print to the prints of every instance of the value ("1" in our example), and decide that the closest match is our variable.
closest match is based on Manhattan distance here but it's an implementation detail
In our example, the 1 that will be interpreted as our variable in the new file name is the correct one (the 2nd), because if you look at all the characters before it, it's the closest match for all the characters that came before our variable in the old file name, it is only missing the characters for "[1080p]" but so is the 1 from "S1", which is also missing the "E" of "E1".
Of course the fingerprint match is a heuristic, the user is the only one to know what they meant with their file renaming, but it does work well in practice and requires very little computation.
So this is more or less how the second version works. Now here's the cool demo showcasing its power đ
In this new interface I added a live preview of what the new titles would become if the renaming was applied, very useful when the program does not work as intended. đ
More importantly, I'm able to do something that wasn't possible before! In this demo there's 2 episode titles, the first one is in french some of the time (idk why), and the 2nd one (in parenthesis) is always in english. Both are recognized as title variables, and in the demo I'm able to get rid of the first one successfully!
I'm now really happy with the results and can switch on to maintenance mode for this project, thanks for reading :)