How to translate the string at local. We can use your favorite editor/tool for translation if you work on po file. Maybe this is not a majority use case, but you could see some patterns in the text, for example, some of the exercises. In that case, we can use regex replacement to effectively work on the translation.
Then, save the po file to your local disk.
Choose the save location.
grep -v ^# learn.math.cc-seventh-grade-math-ja.po > learn.math.cc-seventh-grade-math-ja.po.txtThe result looks like the following.
Remove already translated lines.
The po file has a set of a pair ``msgid'' and ``msgstr''. msgid is the identifier of which string and msgstr is each language translation. For example, a already translated Japanese entry is:
msgid "Fractions, decimals, and percentages" msgstr "分数、小数、およびパーセント"
This has been already translated, so we can remove them first. The result file looks like the following. Also please compare the crowdin window. You see the same untranslated entries in the both.
Updated: 2015-8-6(Thu) I wrote a filter to do this process.
Translate only the msgstr contents.
Now you can translate with your own editor. You can translate strings without Crowdin editor now. So we are free from the current slowness problem.
Here is an example:Original:
msgid "Expressions, equations, and inequalities" msgstr "Expressions, equations, and inequalities"Translated (to Japanese):
msgid "Expressions, equations, and inequalities" msgstr "式，等式，不等式"
Please note, the msgid line has no change.
The below is translated po file.
Upload the po file.
Note: translated po file's file extension must be .po.
You can also upload the file partially. You need the header and some of the (msgid, msgstr) entries.
Check the uploaded strings.
Use the proofread mode, you can check the uploaded entries are correct or not. When you remove the double quote, then upload would fail. But, such entries are usually just not translated.
Here is the project page. You see the learn.math.cc-seventh-grade-math-ja.pot entry is now 100%
In this way, the main task, at least the translation part has no waiting time.
It seems using non-O(n) algorithm to update the database. Since when the data size was smaller, it worked OK, but some data size exceeded, suddenly worked too slow. (But this is just an hypothesis.)
However, this will take time I presume. So, here is workaround solution.
Some of the Khan academy English subtitles are only available on YouTube, not on amara. Here the problem is we usually need a srt file for our translation work flow. However, YouTube provides Transcript as shown in the image below, and you can copy and paste as text.
If we have a converter from this YouTube Transcript text to srt file format, we can translate the file and upload to amara.
This is a python script to do that. New BSD License. Anyone can use this script freely. I suggest to save this python script as 'txt2srt.py' since the following usage refers this script as 'txt2srt.py' (to make the explanation easier).w
|Input (YouTube Transcript format)||Output (srt format)|
0:00Voiceover: The title of Thomas Piketty's book 0:02is Capital in the 21st Century. 0:04It's probably worth having a conversation 0:06about what capital is.
1 00:00:00,000 --> 00:00:02,000 Voiceover: The title of Thomas Piketty's book 2 00:00:02,000 --> 00:00:04,000 is Capital in the 21st Century. 3 00:00:04,000 --> 00:00:06,000 It's probably worth having a conversation 4 00:00:06,000 --> 00:00:10,000 about what capital is.
txt2srt.py --infile trans.txt > trans.en.srt
Timing generation of a subtitle is time consuming work. If we can automated, or at least semi-automated, it could save time. I tried the following method and save up to half of the time so far with my work flow.
My case, I work on a srt file. I use amara or camtasia studio for subtitle timing generation. But this is basically a manual work. I have already dubbed video and translated text. So, I discuss with some friends and research a subtitle timing generation, then, I got a following idea.
Use YouTube transcript mechanism to generate the srt file.
So the basic technology is YouTube's transcript. The YouTube's transcript accepts a text file. And YouTube transcript sometimes cut the original text lines and make multiple lines, so the simplest method needs only two filters: srt to text filter, srt file subtitle line concatenations. This is just a simple text filter.
srtconv.py -i input.srt -o output.srt --outtype text
srtconv.py input.srt -o output.srt --outtype srtcatline
Tips: If you have any new idea for srt file filtering, this script has a simple srt file parser. You can use that parser as you like. The license is New BSD License, so you can freely use it.
We see some srt file has the following two time information problems.
srtconv.py has a option
rm_gap. This automatically fix the inconsistency and remove gaps.
But there is a limitation. The algorithm only looks up the current
subtitle line and the next one. Therefore, if the inconsistency is over
than that this program casts an error. For example,
1 00:00:04,000 --> 00:00:06,000 The first subtitle start at 4 seconds, ended at 6 seconds. 2 00:00:02,000 --> 00:00:03,498 The second subtitle start at 2 seconds, ended at 3 seconds.
The program only can change the end of first subtitle timing, and cannot find the valid time in such case.
It is possible to write a code that keeps consistency globally, but then, this will change the start timing of subtitle line. This is usually not wanted. Thus, this time inversion case as in the example should be fixed manually.
The following Figures 1 and 2 show the result of this process.
When a po file on crowdin has updated, you need to extract untranslated entries. This filter pofilter does the work. New BSD License. Anyone can use this script freely. I suggest to save this python script as 'pofilter.py' since the following usage refers this script as 'pofilter.py' (to make the explanation easier).
pofilter.py content.chrome-ja.po content.chrome-ja.po.txt