This is one of the interesting topics to work on
but sometimes it gets
really difficult to debug. One issue that I came across recently was related to
character encoding in ooxml documents..
In my case, we were reading a template ooxml (Word 2007/10
based document) and replacing a place holder with some specific value and then writing
it to another file. The template based ooxml document would open from Word
application and show the placeholder.
However, when we read the file and replaced the place holder
with valid value, the newly created ooxml document wouldn’t open in Word and give the following error.
On further analysis, I found that the newly created invalid xml
somehow added some junk characters as shown below (left hand side is invalid
xml and right hand side is valid one) –
On further analysis, I found that it was an issue while
reading the file and by default, cp1252 character encoding was being used.
When I changed it to UTF- 8 as shown below,
it worked like a charm.
We need to be really careful especially when dealing with
xml documents especially with respect to character encoding because what goes
in doesn’t necessarily come out as it is....


