a 'mooh' point

clearly an IBM drone

Document translation sucks (When Rob is right, he's right)

It is very seldom I read one of Rob's posts and think "That is just so true" - but yesterday was one of those occasions. I was reading through his latest post about load of different documents in a couple of applications and I couldn't help but smile when I got to the part where Rob made som observations about possible reasons for the poor load times of ODF-files using Microsoft Office 2003:

What is a file filter? It is like 1/2 of a translator. Instead of translating from one disk format to another disk format, it simply loads the disk format and maps it into an application-specific memory model that the application logic can operate directly on. This is far more efficient than translation. This is the untold truth that the layperson does not know. But this is how everyone does it. That is how we support formats in SmartSuite. That is how OpenOffice does it. And that is how MS Office does it for the file formats they care about. In fact, that is the way that Novell is now doing it now, since they discovered that the Microsoft approach is doomed to performance hell.


I have been trying to pitch my idea of "document format channels" for some time now. The basic idea is not to do translations between formats but to support the feature sets of both formats in the major applications.

I remember when I participated in the interop-work for the Danish Government in Fall 2007 and we tried to say something clever about the dissapointing results we saw of translation, we heard the rumours of Novell skipping the XSLT-translation of ODF to OOXML (and vice versa) and instead extend the internal object model of Novell's edition of OpenOffice.org . This was there the idea was born.

The idea was to round-trip documents in the format they were born and not to attempt translation (also, how the hell do you translate e.g. a digital signature between an ODF-file and an OOXML-file?).  What triggered the "vision" was that 1) the formats are not fully compatible and 2) translation sucks. In every interop-session I have attended and in every piece of interop-work I have participated in, there has been one, crystal clear conclusion:

When you translate, you loose information.

Essentially, translation is a poor-man's document consumption, because if you loose information when translating - why would do it? As Rob so correctly points out - when Microsoft chooses to use translators to enable "support" for ODF in their Microsoft Office suites, it's really another way of saying: "We don't really care about ODF". The same thing naturally goes for OpenOffice.org (and spin-offs). When they insist of implementing just import filters for OOXML and use translators to do so - they are saying exactly the same: "We don't really care about OOXML". In both cases what they are communicating to their users is really

We don't care that you loose information - you'll just have to settle for half of the correct solution

It's the same message I hear when some of my colleagues come to me and say: "Jesper, I finished the piece of code you wanted me to do". Sometimes I am blessed with conversations like:

Colleage: I finished the code piece
Jesper: Cool - does it work all right?
Colleage: Eh well, it compiles just fine ...

Is that good enough?

(and with this friendly post, I can only hope "someone" will accept the LinkedIn-invitation I sent in February just before the BRM in Geneva ... or maybe I should try Diigo instead?)

Smile

Challenge (Part II)

A tongue-in-cheek challenge for Mr. Rob Weir.

[code=xml]<?xml version="1.0" encoding="UTF-8"?>
<office:document-content
  xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0"
  xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0"
  xmlns:table="urn:oasis:names:tc:opendocument:xmlns:table:1.0"
  office:version="1.1">
  <office:body>
    <office:spreadsheet>
      <table:table table:name="Sheet1" table:protected="true" table:protection-key="8A45FB0C33667F9E33ECA007FCE4F6684DC5F242">
        <table:table-column />
        <table:table-row >
          <table:table-cell office:value-type="float" office:value="10">
            <text:p>10</text:p>
          </table:table-cell>
        </table:table-row>
        <table:table-row >
          <table:table-cell office:value-type="string">
            <text:p>
              Dear Rob Weir. Please prove by this example that ODF is an "interoperable"
              document format and tell me how a consuming application should determine if the
              user should be allowed to modify the document. I do not think that it is.
              In fact I think that your statements that ODF is a document format that
              provides interoperability are brash, irresponsible and indefensible
              pieces of bombast that you should retract.
            </text:p>
          </table:table-cell>
        </table:table-row>
      </table:table>
    </office:spreadsheet>
  </office:body>
</office:document-content>[/code] 

(and yes, one of the reasons for this post is to show off the cool syntax highlighter of this blog engine)

Wink

And could you guys please stop the bickering and let's move on to something a bit more interesting? 

Conformance of ODF-documents

Ever since the now infamous article by Alex Brown the blogsphere has been filled with interpretations of the, really not so surprising, results - that the OOXML document with the original ECMA-376 spec does not conform to IS 29500.

The, really not so surprising, conclusions have been "Office 2007 does not even produce valid OOXML" followed closely by statements like "This shows that Microsoft Office 2007 should not be allowed since it does not produce valid OOXML".

Hmmm ... ok.

As some of you might remember, I participated in some lab tests with OOXML/ODF interop in Fall 2007. Basically I sat in a small room with guys from IBM, Microsoft, Novell and some guys from the Danish National IT- and Telecom Agency sifting through documents, converting them and examining the resulting XML generated. The documents we worked on were supplied by different parts of the Danish public sector. They were basically told to use some of their existing documents as basis for the parts of the tests they participated in. So these documents were real-world-documents.

One of the things we tested was to see if the documents were in compliance with their respective specs. The original OOXML-documents we tested were all compliant to the ECMA-376 spec ... but it was a different case with the ODF-documents. So the other day I tried to validate all the sent-in original ODF-documents supplied to us.

The results are illustrated in the table below:

File name

Generator

Konklusion

DFFE_Afgået svar til Jane Doe.odt

OpenOffice.org/2.3

not valid

DFFE_SJ_(1) - 15-06-2007 Foreløbig Høring om forslag.odt

OpenOffice.org/2.0

valid

GRIBSKOV_bek-281(BS).odt

OpenOffice.org/2.0

valid

GRIBSKOV_Standardbrev ifm ITST pilotprojekt.odt

OpenOffice.org/2.2

valid

GRIBSKOV_Udkast til Forslag til Lokalplan.odt

OpenOffice.org/2.1

not valid

ITST standardbrev ODT.odt

OpenOffice.org/2.0

valid

ITST Testdokument ODT.odt

OpenOffice.org/2.2

not valid

RM Kursusmateriale.odt

OpenOffice.org/2.0

not valid

RM Standardbrev 2s.odt

OpenOffice.org/2.3

not valid

The table contains information about the file name of the original document, the application that generated it (from the META-file in the ODF-package) and if the document passed the test.

Overall conclusion of this was:

Application

Creates consistantly valid ODF?

OpenOffice.org/2.0

 

OpenOffice.org/2.1

 

OpenOffice.org/2

OpenOffice.org/2.3

 

So should we demand that OOo not be used at all? Of course not, but we should keep the pressure on the OOo-team to fix their code ... just as we should with Microsoft and Microsoft Office.

Object-embedding in OOXML with Microsoft Office 2007

(updated 2008-04-14, added links to external resources) 

Now that the ISO-vote and approval of OOXML is done with, it is time to continue the coverage of implementing OOXML as well as ODF – this time about OOXML, Microsoft Office 2007 and embedded objects.

As I have previously said, there are always quirks when it comes to implementations of any standard in large applications. I have covered a few of these already regarding mathematical content [0], [1] and it is no different with regards to object embedding. I should say that a source of inspiration to this article was Stepháne Rodrigues’ article about binary Parts of an OOXML-file (OPC-package).

Now, embedding objects in an OOXML-file is pretty straight-forward: Simply add the object somewhere in the package and make a reference to the location and specify what kind of file you are embedding. This is very similar to how it is done in ODF.

(note: the specific schema-fragments defining how to do this were dealt with and changed at the BRM, so I will not include these until the final version of IS 29500 is released. I will update this article according to the revised spec).

As I have noted earlier, interoperability happens at application-level, so it is worth pondering a bit on how the specification is implemented in the major implementations of it. So let’s see how Microsoft Office acts when embedding objects.

What I did was this: 

I used Microsoft Office 2007, created a text-document and I embedded an object in it – in this case an OpenOffice.org Calc Spreadsheet. The spreadsheet is also inspired by one of Stepháne Rodrigues’ articles, the infamous “OOXML is defective by design”.

 

The object is inserted and displayed in the document. When activating the object, I can edit it as if it was in OOo Calc itself. Actually it is OOo Calc itself. It is invoked using OLE and as a side-note it shows a cool thing about OLE – or similar other object linking techniques. Microsoft Office 2007 does not know anything about OpenOffice.org, yet it is still able to invoke the application and edit the embedded object.

 

Ok – now let’s look at the OOXML-file created. In the file document.xml the following fragment is located:


The <v:shape>-element is part of the nasty VML-dependency that luckily was dealt with at the BRM. This will be replaced by DrawingML in the final IS 29500. The <o:OLEObject>-element specifies the type of the embedded object (“opendocument.CalcDocument.1”) and the location of it (“rId5”). There is really nothing platform dependent here in the OOXML-markup.What is more interesting, though, is looking at the Calc-object after it is embedded. By navigating through the relationship-model of the OPC-package, the embedded object is located.

 

One might think that this file was simply the Calc-file renamed, but sadly this is not so. This file is actually the Calc-file wrapped in an OLE2 Compound file (“CF”). The CF-file is basically a stream wrapper which allows a number of streams to be persisted in a file as well as information about these streams. Using one of the many CF-viewers you can get the data of the wrapped file itself as well as the persisted information of it, here “com.sun.star.comp.Calc.SpreadsheetDocument _   Embedded Object _   opendocument.CalcDocument.1”.

 

 

Technically this is really not a big deal – there are well-known ways to manipulate these files on all platforms and most programming languages and extracting the required data should really be a no-brainer. OpenOffice.org is licensed under LGPL, so you can use the source-code from this to figure out how to do it on the platforms supported by OpenOffice.org. It is also pretty evident why Microsoft Office 2007 works this way. Microsoft Office 2007 is the latest incarnation of the Microsoft Office Suite – a suite that has depended on this file format since at least 1999 … and of course on OLE itself as well. So if you want to implement a document consumer, this is simply something to be aware of when consuming OOXML-files.

From the perspective of a developer, however, this is really annoying. I would definitely opt for Microsoft Office 2007 embedding the objects simply as the objects they are – and not wrapping them in a CF-wrapper. This is how it is done in OpenOffice.org. Granted, this suite does other weir(d) things like renaming the files and not being entirely clear how to embed all object types, but the objects are embedded as they are (unless they are OpenDocument objects). This is a benefit to me as a developer when examining OOXML-files, because I can simply extract the object in question from the document package and verify the file.

So this might be the first new post-vote change-modification to IS 29500:

 

When embedding objects an application shall not modify or wrap the embedded object in any way before embedding it in the package. When a document consumer encounters an embedded object, this shall not be converted to another object type without knowledge-based confirmation by the user.

 

This (or similar woring in standard-lingo) would prevent Microsoft Office in wrapping objects on CF-wrappers, but it would also prevent applications like OpenOffice.org on SUSE to convert embedded Excel-objects to Calc-spreadsheets. FYI, this kills interop too.

A final request: Microsoft, please, as you must already be implementing the changes from the BRM for Office 2007, would you be so kind to make this change to the application as well? It should really be a no-brainer, and if there should be any requirements in your code for the CF-files, feel free to load the objects, wrap them in an in-memory CF-file and take it from there.

Smile

Interoperability - between what?

What is interoperability, really?

Well, when it comes to document formats, some people seems to think that interoperability is the ability to transform one format to another. That high-fidelity interoperability can only be achieved when it is possible to perform a complete translation/conversion of format X to format Y.

The basic problem for this premis is that if you were able to do this conversion, it would be the same as being able to make a 1-1 mapping between the functionality and features of format X and format Y (and vice versa). However - this effectively means that format X is actually just a permutation of format Y ... making format X and format Y the same format (pick up your favorite book on mathematical topology to see the details).

When it comes to ODF and OOXML, the case is pretty clear - the two formats are not the same. Sure - they can both define bold text,  but there are quite a few differences between the formats. A list of some of them can be found at the ODF-Converter website. I think that the list is the best argument for not being able to do a complete conversion of ODF to OOXML (and back). This was also one of the conclusions of the Frauenhofer/DIN-work in Germany, where they concluded that a full 1-1 mapping between the two formats could not be done.

The key question here is: Is interoperability diminshed by this fact?

If you ask Rob's posse, they will almost certainly say "Yes". They will say something like "Microsoft chose not to make OOXML interoperable with the existing ISO-standard ODF and therefore OOXML is a blow to interoperability".

If you ask me, I will say "No". I will say no because the term "interoperability" has been hijacked by the anti-OOXML-lobby in much the same way the SVG-namespace was hijacked by ODF TC. I will say "No" because interoperability means something radically different. The meaning is not rocket sciency, really ... and usually most people agree with the basis definition of interoperability. A few of those are:

Computer Dictionaly online: 

http://www.computer-dictionary-online.org/interoperability.htm?q=interoperability

The ability of software and hardware on multiple machines from multiple vendors to communicate.

IEEE: 

http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?tp=&isnumber=4683&arnumber=182763&punumber=2267

the ability of two or more systems or components to exchange information and to use the information that has been exchanged

US e-Government Act of 2002:

http://frwebgate.access.gpo.gov/cgi-bin/getdoc.cgi?dbname=107_cong_public_laws&docid=f:publ347.107.pdf

ability of different operating and software systems, applications, and services to communicate and exchange data in an accurate, effective, and consistent manner.

If you also look at the enormous list from Google you will see, that none of the definitions talk about the ability to convert formats. Instead they talk about communication between machines, platforms and networks. This is very close to my definition of interoperability when it comes to document formats.

The interoperability gained by using a specific document format is based on the possibility of implementing the format on any kind of platform, in any kind of software using any kind of operatingsystem. It is based on how well and consice and clear the language of the specification of the format is and it depends of howwell thought out the specification is.

It has nothing, nothing, nothing to do with the possibility of converting the format to any other format. 

What's up with OLE?

A few weeks back I made an article about how Microsoft Office 2007 dealt with password-protection of an OPC-package, since this feature is not a part of the OOXML-specification. The answer I found was that Microsoft Office 2007 persists the password-protected file as a OLE2 Compound File ... more commonly known as a "OLE-file". I also concluded that using OLE2 Compound Files is not a problem - and certainy not an issue regarding OOXML.

Now - the whole topic around OLE has been at the front row of the worldwide debates regarding OOXML. My personal opinion is that the people jumping up and down screaming about problems with OLE ... really haven't understood what OLE is.

So let me start by making a small recap' of what it is really all about.

... there is OLE and then there is OLE 

First of all:

there is "OLE" and then there is ... "OLE"

... or put in another way:

there is the "OLE-technology" and then there is the "OLE-file"

or in a third, more correct, way:

there is the "OLE application technology"  and then there are "Compound Files".

The foremore mentioned is the technology that - on the Windows platform - enables a program to use the UI of another program ... without launching the entire application itself. I mostly use this when editing MS Visio-documents in Word but other usages of this is using an Excel spreadsheet in an MS Word application. The OLE-technology itself is a tool on the Windows-platform that all applications can - and do - use to enable "utilizing other applications in their own applications". It is here important to understand, that there is (today) nothing really revolutionary about OLE. Another similar technology on the Windows-platform is DDE and on the Linux-platform it could be KParts and Bonobo. These technologies simply enable one program to communicate with another (simply put).

But what about these OLE-files?

Well, Compound Files are actually not dependant of OLE-technology. Or put in another way: you don't need OLE-technology to read and use the contents of a Compound File. Compound Files are just files. A Compound File is a collection of persisted streams - actually much like a ZIP-archive. Most commonly it is used because it brings the ability to "utilize a file system within a file". Of course you will need to know how to use the contents of the file, be it created by OpenOffice, Corel Draw, Adobe Acrobat or any application that might store its files using Compound Files. But this is seperate from being able to read and write to the contents of a Compound File.

Ok - I will not bother you any more with this. You should check out the original article about OLE and also look into the specification of the binary formats for Microsoft Office95 - Office2007, avilable from Microsoft. It is actually quite interesting. Just remember that OLE-technology and Compound Files are not the same thing.

And now for something completely different (kindof)

In the lab-tests I have been part of for the Danish Government (National IT and Telecom Agency) we have tested OLE-interoperability. It is important since it is quite normal to embed e.g. a spreadsheet file in a Text-processing file. So it is important that the contents of the file is actually usable when receiving it and opening using another application or on another platform.In this setup we only tested Compound File interop and not interop between OOXML and ODF.

What we did was this:

We created a ODF-file using OpenOffice where we embedded a Excel-spreadsheet (binary .DOC-file) (on the Windows-platform)

We sent this file to a number of different platforms and applications

  • Windows XP using OpenOffice.org 2.3 DA
  • Windows XP using OpenOffice Novell Edition
  • Linux using OpenOffice Novell Edition
  • Linux (SLED) using IBM Lotus Notes 8

We tried to open the file and documented what happened.

#
Setup  What happened? 
1
Windows XP using OpenOffice.org 2.3 DA OpenOffice.org opened the document and correctly displayed the contents of the spreadsheet. It was possible to edit the spreadsheet and save it back into the ODF-container
2
Windows XP using OpenOffice Novell Edition OpenOffice Novell Edition opened the document and correctly displayed the contents of the spreadsheet. It was possible to activate the spreadsheet but only in "read-only"-mode
3 Linux using OpenOffice Novell Edition OpenOffice Novell Edition opened the document and correctly displayed the contents of the spreadsheet. It was possible to activate the spreadsheet but only in "read-only"-mode
4
Linux (SLED) using IBM Lotus Notes 8 Lotus Notes 8 opened the document and correctly displayed the contents of the spreadsheet. When activating the spreadsheet the user was prompted to convert the spreadsheet. When accepting this it became editable and when saving it back into the ODF-container, the spreadsheet was persisted as an Open Document Spreadsheet.


So what we saw was basically 3 different approaches to handling the embedded object. In general the Excel-object (Compound file) itself was not a problem - regardless of application and platform. All combinations had no problems with opening the file and displaying the contents - even on platforms without OLE-technology present. The difference was in the applications and their handling of the object. OpenOffice.org presented the approach that most people would expect: it allowed editing the embedded object and saving it back into the container. OpenOffice Novell Edition allowed activating the embedded object but not saving it back into the container and Lotus 8 took the approach of converting the Excel-object to an Open Document Spreadsheet.

A conclusion?

Well, we took great care not to conclude much - that was not for us to do, we merely provided the technical background for post-lab conclusions. However - the pattern emerging from the description above was similar to a pattern we saw a lot. The problems were not in incompatibility between the formats but instead in how the applications and converters dealt with the formats. We also saw no indications that any of the formats were tied to a specific platform. There were no problems with roundtripping - or to put more clearly: the problem we saw when round-tripping documents were not caused by incompatibilities between the platforms (e.g. Linux and Windows) but between different behaviour in the applications implemented on either platform.

So is this good or bad news? Well, as always, truth lies in the eyes of the beholder ... but I think it is good news. 

What is a conversion, really?

I have been part of some work for the the Danish National Telecom and IT Agency (IT- og Telestyrelsen). They have coordinated quite a few projects around the country to evaluate the usage of ODF and OOXML and possible problems with co-existance of the two document standards. The website for this work is at http://dokumentformater.oio.dk .

The basic setup for the projects and tests has been:

How does a particular department handle the two document formats and possible conversion between them?

Which problems will arise given their current software install-base?

Is it possible to provide some guidance to the departments regarding which specific features of a document format to avoid since they cause problems?

In other words it has been a rather pragmatic approach based on trying to answer the question: "Why do you experience the problems you see?"

Observations

The first thing we realized during the very first day was something quite crucial:

We were not testing compatibility between two formats - instead we were testing quality of converter-tools and compatibility between the specific format and the internal object model the format is loaded into.

Converter-tools

Both OOXML and ODF are rather immature document formats in the market today since neither of them has a broad market penetration as such. Despite the document count on Google, ODF is not widely used and most people still save their work in .DOC-files -even though they have Microsoft Office 2007 installed. This means that conversion between them is also rather immature and this affects the quality of the converters and the results of converting between one format and another. The ODF-Converter project has an extensive list of the differences between the formats themselves and also a list of features currently not supported by the converter and similar lists exist of features not supported by the other tools used. Luckily it seems that the quality of the converters are drastically improving for each incremental new release.

We also noted that a converter is not "just a converter". It lives and breathes on the application it is installed. This was of particular interest when looking at the ODF-Converter Office Add-In and the SUN OOXML-converter. They are both add-ons to existing Office applications but the application behaviour we saw was in principle the same when using OpenOffice.org, IBM Lotus Notes 8 or OpenOffice Novell Edition.

The problem lies in the fact, that every application has an internal object model that determines how a document is persisted in memory in the application. The binary format for Microsoft Office files were essentially a binary dump of the current memory in the application and this basically counts for at lot of applications with binary file formats. Anyway - regardless of how a document is "converted" or "transformed" using another application than the originator, at the end of the day it has to be loaded into the internal object model for the receiving application. This essentially means, that unless there is a 100% air-tight 1-1-mapping of the document format and the internal object model ... information will be lost. This was one part of the problem - the other was the sequence of conversion. Take a look at the sequence listed here:

Sequence 01  Sequence 02
   
load original format Load original format
 ↓
Convert format to new format Load original format into internal object model
 ↓
load new format into internal object model (make changes)
 ↓
(make changes) Persist as new document format
 
Persist as new document format  

It is not entirely evident that this will produce the same output, and we have seen no evidence that any of applications tested did actually have a 1-1 mapping between (any) document format and their internal object model. This also counts for Microsoft Office and its corresponding file types and OpenOffice itself. In short, this was a fact that we had to deal with in our tests.

On a funny note:

The conversion tools we used were all based on XSLT-transformation between the document formats. They are both XML-formats, so it is a good choice. However, we heard rumours that Novell would dump their OOXML-converter (based on XSLT) and develop their own converter based on the internal object model. It will be interesting to see, if it brings greater quality to the converters.

On a lighter note:

We saw in our tests that using the binary Microsoft Office file format as a middle-man when converting from OOXML to ODF (and back) actually produced the best results ... by a long shot. Having this step and using the binary Office file format as a type of "Lingua Franca", was more or less the key to "flaw-less conversion". If you stop and think about it, it makes perfect sense why we saw this. The Microsoft Office Binary file format is well established in the market (not thanks to Microsoft, but to reverse engineering) and the format has been arround for a long time. Basically, all applications can read it and all applications can write it. But why is this interesting? Well, OOXML is an XML-version of the binary Office file format, so since there are "no problems" with converting from the binary format to ODF, it should be technically relatively easy to convert from OOXML to ODF, since OOXML is a binary version of the binary file format.

It is just a matter of time ... and continious improvement of the format converters.