Internationalizing a Web Application with Half a Million Lines of Code
Have you ever been tasked to translate an existing application into various languages? Common advice would be to start early. What if you have to add internationalization to an existing application with more than half a million lines of code? Here is how we prepared for this challenge.
By the end of last year, I was tasked to explore the scope of translating our existing application into other languages. Major unknowns were: how many many messages needing translations does our application have, and is such a project feasible at all?
I worked as a Staff Engineer for Brandwatch, a leading analytics company. My primary focus is our user facing web application. This code base is quite large, with more than half a million lines of code. The first commit was made in 2011. The technology landscape of frontend development has changed a lot since then. The combination of frameworks from different eras, the long history and the active development makes this project even more challenging.
Understanding our text usage
Interestingly a first glance, our application doesn’t look that text heavy. It’s a web application with a few areas containing some text, and obviously charts all over the place. Even if you dig into the application it remains data centric. Should be easy, right? Let’s dive in and see.
Code vs. human text
A human can easily spot the difference between application code and text. However, I think it would be very impractical to count words manually. So I asked myself “what’s the main difference between code and text?“
Application code is rarely surrounded by quotation marks unless it is a string. That is our starting point. I wrote a regular expression to find text between quotation marks. A first step towards our solution but still not perfect. The false/positive rate was quite high since, surprise surprise, not every string between quotation marks is a message requiring translation.
We adjusted the regular expression to take only strings starting with a capital letter into account. Giving us a comprehensive list of text used in our application. This list is however not complete at all. String concatenation, lower cased sentences, and developer facing messages are just a few issues we still need to address. Still, the list is very insightful already.
At a later stage of the project I was using an AST (Abstract Syntax Tree) to increase coverage. This turned out to be more reliable and stable than the strategy with a regular expression. Let us pause here for now and see what HTML templates entail.
Find text in HTML templates
In a previous blog post (which is my most popular post so far) I explained how to use Comby to refactor code with ease. Comby is a really neat tool which provides a simple pattern matching API, and I made heavy use of it for the next task.
Text appears differently in HTML:often as a child or sibling element, but also as the value of an attribute. Let’s have a look at a basic HTML template.
This template contains three texts (highlighted in yellow) that need to be translated. What they all have in common is that either the text is surrounded by quotation marks, or it is a child element of another element.
Let’s start simple and use a Comby match template to find text between quotation marks.
Look how easy it is to write a Comby match template. This match template will find strings regardless of their meaning between quotation marks. Comby is language agnostic, but its key feature is that it understands the relationships between delimiters, strings, and comments.
Similar to our regular expression approach, this Comby match template matches everything, including attribute value which needs to remain in English. Let’s narrow this down by matching values for the placeholder attribute only.
Great, this works. Now we have to repeat this for all possible attributes like title
, label
, aria-label
, alt
… you get the idea.
Next step: let’s focus on the child elements. An obvious first attempt would be to use a match template like >:[val]<
. This would kind of work, but return a large number of odd results because it does not respect the natural word boundaries if you have nested elements.
<span>The quick <strong>brown fox</strong> jumps over the lazy dog.</span>
According to the span HTML spec, only “'phrasing'” content is allowed inside a <span>
-tag. With this in mind we can tweak our matching template to only match <span>
-tags.
<span>The quick <strong>brown fox</strong> jumps over the lazy dog.</span>
There is a caveat with this. As you can see the match string contains HTML as well. For our exploration we are interested in the English phrase only. To not overcomplicate things, we use a post process job to strip out such HTML tags.
Finally, let’s repeat this for other tags that we are interested in like <h1>
, <h2>
, <button>
, <label>
, <p>
…
Conclusion
In the end, we identified over 4000 single words / sentences which make up more than 10,000 words. There is a lot that we haven’t covered: string concatenation, nested elements and other complex constructions. Regardless we got a much better understanding how much text our application contains, and it was much more than we thought initially.
Thanks for reading. For any questions, follow me or send a message to me — I’m @buckstefan on Twitter.