Monday, April 22, 2013

Character set encoding in xml



This is one of the interesting topics to work on but sometimes it gets really difficult to debug. One issue that I came across recently was related to character encoding in ooxml documents..
In my case, we were reading a template ooxml (Word 2007/10 based document) and replacing a place holder with some specific value and then writing it to another file. The template based ooxml document would open from Word application and show the placeholder.






However, when we read the file and replaced the place holder with valid value, the  newly created  ooxml document wouldn’t open in Word and give the following error.


On further analysis, I found that the newly created invalid xml somehow added some junk characters as shown below (left hand side is invalid xml and right hand side is valid one) –


On further analysis, I found that it was an issue while reading the file and by default, cp1252 character encoding was being used.


When I changed it to UTF- 8 as shown below,




it worked like a charm.

We need to be really careful especially when dealing with xml documents especially with respect to character encoding because what goes in doesn’t necessarily come out as it is....


1 comment: