a 'mooh' point

clearly an IBM drone

DII ODF workshop - the good stuff

... continued from DII-workshop in Redmond - round-table discussions.

So - let's get down to what was the real purpose of going to Redmond - apart from the great breakfast I had at Lowell's in Farmer's market in Seattle - to test the pre-alpha version of Microsoft Office 2007 SP2 and its ODF-support.

(let me start by appologizing for the late post, but I lost my USB-drive with my test-files on, and I didn't find it until a few days ago)

I have already listed some of the findings of the day in my previous post, so I'll try to get into more detail here.

What did I do?

Well, we would have some hands-on time with the latest build of Microsoft Office 2007 SP2 (apparently directly from a developer's machine) so I brought a bunch of documents I have worked on before - some of them was from the application interop-work I participated in in Fall 2007 for the Danish National IT- and Telecommunication Agency. Others I have created myself. I performed the following steps for each file:

  1. Load the ODF-file in OpenOffice.org 2.4
  2. Create a PDF-file of the document using a PDF printer driver (CutePDF)
  3. Load the ODF-file in Microsoft Office 2007 SP2
  4. Do a "Save as ODF" and prefix the original filename with "MSO". According to the Microsoft project managers I talked to, this would ensure I actually saved a version of the ODF-file that had been processed by the internal object model of Microsoft Office 2007 SP2.
  5. Create a PDF-file of the document using a PDF printer driver (CutePDF)

Below I have listed for each document the following data:

Original file: somefile
Original file New file
Generator: SomeApplication PDF Generator: Microsoft Office 2007 SP2 PDF

For each I will include some tech remarks on interesting subjects - if any.

There are a couple of things to note on a general level before we get started. Microsoft has chosen to follow implementation of ODF "by the book" in the sense that they have not looked so much about bugs or "features" in competing applications. This has the peculiar effect that perfectly legitimate ODF-files produced by Microsoft Office 2007 SP2 might not properly in competing applications. For more general ideas of what they did, you should check out Dennis Hamilton's post from the workshop. It is by far the most comprehensive of the ones posted since last week.

Original file: Testfile_03.odt
Original file New file
Generator: OpenOffice.org/2.4$Win32 PDF Generator: Microsoft Office 2007 SP2 PDF

Remarks

This file is an ODT-file with an embedded ODS-spreadsheet. Loading this file into Microsoft Office shows a nice red cross and no spreadsheet. An inspection of the ODT-file shows that the content is pretty much preserved including the embedded ODS-spreadsheet. But when looking at the manifest file, the following appears:

[code=xml]<manifest:file-entry
 manifest:media-type="application/x-openoffice-gdimetafile;windows_formatname="GDIMetaFile""
 manifest:full-path="ObjectReplacements/Object 1"
/>[/code]

It is the location of the graphical representation of the embedded spreadsheet. The media-type seems to be an old StarView Metafile format (confirm, anyone?) and Microsoft Word doesn't understand this image format - hence the red cross. This example highlights one of the points of bad interoperability: Small errors can cause big problems. Everything but the missing image is preserved, but the document becomes useless regardless of this "small" error".

Original file: Testfile_07.odt
Original file New file
Generator: OpenOffice.org/2.0$Win32 PDF Generator: Microsoft Office 2007 SP2 PDF

Remarks

This file is included in the "Self-assesment"-package from the Danish National IT- and Telecom Agency. Loading the letter into Microsoft Office 2007 initially appears to produce an identical file, but even though the content itself is preserved, there are still areas with problems.

  1. There is a border around the logo image in the header
  2. The height of the header is not completely preserved
  3. The "right margin" (which is really a stretched text box) is gone since the text box is wrapped around the text instead of being preserved in its full length
  4. Page numbering is gone on the last page
A funny note: if you load the file generated by Microsoft Office 2007 in OOo 2.4, it loads perfectly fine as the original document. This suggests that the problems encountered by loading it in Microsoft Office 2007 are not problems with converting ODF to the internal object model of Microsoft Office 2007 but instead problems in the layout engines.

Original file: Testfile_08.odt
Original file New file
Generator: OpenOffice.org/2.2$Win32 PDF Generator: Microsoft Office 2007 SP2 PDF

Remarks

This is another document from the Self-assessment package. It contains a few different features; a TOC, colored text, text boxes, a drawing, an embedded spreadsheet as well as some change-modification. This generated document is kind of messy. The content has been "shuffled" around and again we have the problem with Microsoft Office 2007 SP2 not understanding the GDIMetafile image format. The embedded objects are fine themselves - the graphical representation of them is not.

Original file: Testfile_10.odt
Original file New file
Generator: Jesper Lund Stocholm PDF Generator: Microsoft Office 2007 SP2 PDF

Remarks

This file is another one of my own files that I have created earlier. It contains a mathematical formula in MathML. When loading it in Microsoft Office 2007 SP2, the mathematical formula simply dissapears. I am kind of lost on the reason for this. It is not the DOCTYPE-declaration used by OOo (see next file for those details) so maybe it is the construction of my ODT-file that poses an issue for them.

Original file: Testfile_11.odt
Original file New file
Generator: OpenOffice.org/2.4$Win32 PDF Generator: Microsoft Office 2007 SP2 PDF

Remarks

This file is almost identical to the one above - but it is generated by OOo 2.4 instead of me and carries all the styling and configuration that comes with it. Here the file and the mathematical content loads just fine. But an interesting thing happens when saving it again. The MathML-fragment is slightly altered from

[code=xml]<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE math:math PUBLIC "-//OpenOffice.org//DTD Modified W3C MathML 1.01//EN" "math.dtd">
<math:math xmlns:math="http://www.w3.org/1998/Math/MathML">
 <math:semantics>
  <math:mrow>
   <math:mi>cos</math:mi>
   <math:mrow>
    <math:mrow>
     <math:mo math:stretchy="false">(</math:mo>
     <math:mfrac>
      <math:mo math:stretchy="false">π</math:mo>
      <math:mn>4</math:mn>
     </math:mfrac>
     <math:mo math:stretchy="false">)</math:mo>
    </math:mrow>
    <math:mo math:stretchy="false">=</math:mo>
    <math:mfrac>
     <math:msqrt>
      <math:mn>2</math:mn>
     </math:msqrt>
     <math:mn>2</math:mn>
    </math:mfrac>
   </math:mrow>
  </math:mrow>
  <math:annotation math:encoding="StarMath 5.0">
    cos({%pi} over {4} ) = {sqrt{2} } over {2}
  </math:annotation>
 </math:semantics>
</math:math>[/code]

to

[code=xml]<?xml version="1.0" encoding="UTF-8"?>
<mml:math
  xmlns:mml="http://www.w3.org/1998/Math/MathML"
  xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math">
  <mml:mi mathvariant="normal">c</mml:mi>
  <mml:mi mathvariant="normal">o</mml:mi>
  <mml:mi mathvariant="normal">s</mml:mi>
  <mml:mo>(</mml:mo>
  <mml:mfrac>
    <mml:mrow>
      <mml:mi>π</mml:mi>
    </mml:mrow>
    <mml:mrow>
      <mml:mn>4</mml:mn>
    </mml:mrow>
  </mml:mfrac>
  <mml:mo>)</mml:mo>
  <mml:mo>=</mml:mo>
  <mml:mfrac>
    <mml:mrow>
      <mml:msqrt>
        <mml:mn>2</mml:mn>
      </mml:msqrt>
    </mml:mrow>
    <mml:mrow>
      <mml:mn>2</mml:mn>
    </mml:mrow>
  </mml:mfrac>
</mml:math>[/code]

The clever reader will notice that the semantic annotations used by OOo are removed from the MathML-fragment. The MathML is in general altered a bit, but it is not that big changes - most of them are visual things related to styling. The problem is that this MathML is un-consumable for OOo. The MathML-fragment produced by Microsoft Office 2007 SP2 is valid MathML (validated using Amaya) and even though I add the required !DOCTYPE, it still won't load in OOo.

Original file: Testfile_13.odt
Original file New file
Generator: OpenOffice.org/2.0$Win32 PDF Generator: Microsoft Office 2007 SP2 PDF

Remarks

(file has been removed at the request of the originator of the file )

This file is a bit more complex, and as with Testfile_08 it consists of a lot of different parts. Key issues here is failure to read GDIMetaFiles, borders around images, errors in visual presentation of numbering/bulleted lists and lines being much thicker than in the original file. There is really nothing new in this file - just that it confirms the problems identified with Testfile_08.

Original file: Testfile_14.odt
Original file New file
Generator: OpenOffice.org/2.3$Win32 PDF Generator: Microsoft Office 2007 SP2 PDF

Remarks

This file is one of those template-files that are used a lot almost everywhere. You know, someone has created a "standard" document with correct header, footer and images, and this file is then distributed in the organisation. The conversion is actually almost error-free. There is a slight error with respect to border around images and rendering of them, but that is just about it.

Original file: Testfile_20.ods
Original file New file
Generator: OpenOffice.org/2.4$Win32 PDF Generator: Microsoft Office 2007 SP2 PDF

Remarks

(both PFD-files have been created by OOo 2.4/Win32)

I created the file above to illustrate what would happen when working with spreadsheets. I used the infamous CEILING-function, but I was at that time not aware that Microsoft Office 2007 SP2 would throw out formulas from "unknown namespaces". Hence there is very little change - only the visible number of decimals after having been through Microsoft Office 2007 SP2 has been reduced to two. If you look in the XML generated, you will find one interesting thing, though:

[code=xml]<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<office:document-content
  xmlns:table="urn:oasis:names:tc:opendocument:xmlns:table:1.0"
  xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0"
  xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0"
  xmlns:style="urn:oasis:names:tc:opendocument:xmlns:style:1.0"
  xmlns:draw="urn:oasis:names:tc:opendocument:xmlns:drawing:1.0"
  xmlns:fo="urn:oasis:names:tc:opendocument:xmlns:xsl-fo-compatible:1.0"
  xmlns:xlink="http://www.w3.org/1999/xlink"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:number="urn:oasis:names:tc:opendocument:xmlns:datastyle:1.0"
  xmlns:svg="urn:oasis:names:tc:opendocument:xmlns:svg-compatible:1.0"
  xmlns:msoxl="http://schemas.microsoft.com/office/excel/formula"
  >
  (...)
</office:document-content>[/code]

Can you see it?

Conclusions

Well, the investigation above was done based on about 20 files tested and they were primarily text documents (and one spreadsheet). Some of them was created by me and some were created by various parts of the public sector in Denmark. I have only looked at about half of the files, but a few other files are also available shold you wish to play with them yourself. You can get them here: public.zip (3,02 mb).

Validation

I have made some effort to validate the ODF-files generated by Microsoft Office 2007 SP2. What I have done is to download the RelaxNG ODF 1.1-schemas from OASIS' website and I used JING to perform the schema-validation. Since there is a known bug in the schemas I have used JING with the "-i" flag set. Validating the structure of the package itself is a bit tricky (as reported by Rick Jellife) and I have not done that. I have done a schema-validation on the files "content.xml" and "styles.xml" based on the thought, that these are the most complex files in the package. The result of the validation is that all files generated by Microsoft Office 2007 SP2 are valid ODF 1.1-files. I piped the result of the validation into an output file available here for your viewing pleasure: output.txt (1,92 kb).

All in all I think Microsoft has done a pretty good job. Obviously there is still some way to go until it reaches production quality, but I was pleasantly surprised to see the big difference in conversion results compared with the results of the ODF Converter from SourceForge.net I have worked with earlier. There are a couple of things I would like to note, though:

Graphical representations of embedded objects

Microsoft Office 2007 SP2 has problems with reading the graphical representation of embedded objects if the file is created by OpenOffice. It seems that it simply doesn't support the GDIMetaFile-format used by OpenOffice (and its derivatives). I think the "nice" way to solve this would be to load the object (if supported) and render an image of it again. The dimension of the image is available in the <draw:frame>-element and could be used to determine the size of the image.

Embedded objects

I noticed that handling of embedded objects are done using a "don't touch"-approach, which means that when loading an ODF-file with an embedded object, the embedded object is simply copied and not touched by Microsoft Office 2007 SP2 (if they are not activated by the user). I think this is a good approach. Consuming applications should respect the "integrity" of the consumed package and not alter its content unless it has to.

mimetype

A funny little thing: The mimetype-file in the ODF-package is created using CAPITAL letters, i.e. the file will be called "MIMETYPE". This causes the OpenDocumentFellowship validator to fail since it cannot find the file (with non-capital letters). I have suggested to Microsoft to generate the file using non-capital letters to enhance interop and validation across platforms where some are "a bit more" case-sensitive than Windows.

config settings

Microsoft has chosen not to use the configuration elements otherwise to widely used by Lotus Symphony and OpenOffice.org . I am not sure if I think it is a good or a bad idea, but since they do not use the settings.xml-file at all, they should remove the file completely.

Comments are closed