bushi blog Let's all love Lain

Subtitle Website Generation

Stable Diffusion, "neon genesis evangelion eva subtitle file"

This is the process I went through to create a way to view subtitles on a website, generated from an srt file. I did this to be able to watch Evangelion on Laser Disc, which doesn’t have english sub or dub. You can see the result here.

Extract Subtitles #

Use ffmpeg to extract subs encoded with video. Files often have multiple subs, so we have to figure out which we want. You can inspect the file with ffmpeg, or just export each stream and check them.

Files are mounted on a network drive, so we’re doing it in powershell, though the ffmpeg command is pretty portable.

Get-ChildItem –Path "./" | Foreach-Object { ffmpeg -i .\$_ -map 0:s:1 ./subtitles/$_.srt }

Once we have the sub files, we need to prepare them for templating, which means formatting.

Formatting for Jekyll #

I use jekyll for this blog, so we need to convert the sub files into CSV so we can template off the values.

Luckily, the srt format is super simple. Basically just index, time range, text, and an empty line.

Using sed and awk, we can massage it into the format we want. To prepare for storing in a csv, we need to double up any " marks so they’re ignored. After exporting the sub files, I moved them to WSL so we can use bash now.

for f in ./*.srt ; do;
	# print column titles
	echo "index,timecode,text" > "${f%.srt}.csv";
	# process subtitle file
	cat $f | 
	# replace single quote with double 
	sed 's/"/""/g' | 
	awk 'BEGIN{RS="";FS="\n"}{print $1 "," "\""  $2 "\"" "," "\""$3 $4 $5 "\"" }'
	>> "${f%.srt}.csv";

Or a monster one-liner:

for f in ./*.srt ; do; echo "index,timecode,text" > "${f%.srt}.csv"; cat $f | sed 's/"/""/g' | awk 'BEGIN{RS="";FS="\n"}{print $1 "," "\"" $2 "\"" "," "\""$3 $4 $5 "\"" }' >> "${f%.srt}.csv"; done;

After running this, I needed to make a few more changes to the original srt files. Some entries have empty lines in the subtitle text part, which screw up this parser. Instead of trying to work around it, I just found all the places that matched the regex \n\n[^\d] and fixed them manually.

Jekyll include page #

Next we need to setup a reusable include page that we just feed csv data to output all the subtitles in a table. We jsut need to pass the episode title and reference the episode csv data in a way jekyll likes. I came up with:

<!-- Call with % include subtitles.html title="asd" srtdata=site.data.eva_subs.subname % -->
<summary>{ include.title }}</summary>
<table style="font-size: 1em;">
	% for entry in include.srtdata %}
		<td>{ entry.timecode }}</td>
		<td>{ entry.text}}</td>
	% endfor %}

Styling #

A new problem that arose was styling The subtitles are meant to be displayed on a screen, so a lot of the formatting isn’t exactly website friendly. The original .srt formatting used <font></font> tags, which aren’t in html5. Simply used crtl+shift+f to replace those and fix the styles, including color and font size.

Normalized it to a few common values, since it will just be displayed on a web page.

Subtitle Page #

Now we create a page that uses are includes file to display subtitles. I opted to display all the subtitles for a series on a single page, so I only have to create a single page. There’s no easy way to generate sets of pages, so a page per series isn’t too much to ask.

It ended up just being:

layout: default
title: Home

%- include subtitles.html title="01 Angel Attacks" srtdata=site.data.eva_subs.eva_01_Angel_Attacks -%}
%- include subtitles.html title="02 Unfamiliar Ceiling" srtdata=site.data.eva_subs.eva_02_Unfamiliar_Ceiling -%}
Home |About |Crypto |Webring |Links |Sitemap|RSS|Email|Mastodon