a 'mooh' point

clearly an IBM drone

Losing data the silent way - ISO8601-dates

In Prague we spent quite some time discussing how to deal with the fact that applications supporting ECMA-376 1st Ed. not necessarily support ISO/IEC 29500:2008 strict as well. Our talks revolved primarily around how major implementations dealt with the modified functionality of the elements <cell> and <v> in SpreadsheetML now that ISO-dates are allowed as content of the <v>-element. But “dates in spreadsheets” is not the only place where changes occurred. Changes were also made to other areas, including

  • Object embedding
  • Comments in spreadsheets
  • Hash-functions for document protection

This will be the first post in a series of posts evolving around how IS29500 differs from ECMA-376 and how existing applications behave when encountering a document with new content. What I will do here is to create some sample documents and load them in the applications I have access to that supports OOXML the best. In my case these are Microsoft Office 2007 SP2, OpenOffice.org 3.0.1 and NeoOffice for Mac and Apple iWorks. If you want to contribute and you have access to other applications, please let me know the result and I’ll update the article with your findings. If you have access to Microsoft Office 2007 SP1, I'd really like to know. When the series is done I’ll post a bit about MCE and how it might help overcome some of the problems I have highlighted (if we’ll get to change the namespace for the strict edition of IS29500 schemas)

I should also note that as the series progresses, the examples I make will increase in complexity. A consequence of this will be that my examples will be more of a “magic-8-ball-type prediction” than “simple examples of IS29500-strict documents”. Since there is not a single application out there supporting IS29500-strict, the examples will be my “qualified guesses” to how applications might interpret IS29500-strict when they implement it.

ISO-8601 dates in SpreadsheetML

Let me first touch upon the problem with dates in SpreadsheetML since this was the problem we talked about the most. Gareth Horton from the UK national body hand-crafted a spreadsheet document with these new dates. I have modified his example a bit to better illustrate the point. Files are found at the bottom of this post.

In the original submission to ISO dates were persisted in SpreadsheetML as “Julian numbers” (serial representation) and subsequently formatted as dates using number format styles.

[code=xml]<sheetData>
  <row r="1">
    <c r="A1" s="1">
      <v>39904</v>
    </c>
  </row>
  <row r="2">
    <c r="A2" s="1">
      <v>39905</v>
    </c>
  </row>
(…)
  <row r="10">
    <c r="A10" s="1">
      <v>39913</v>
    </c>
  </row>
</sheetData>[/code]

So the above would create a column with 10 rows displaying the dates from April 1st to April 10th.

Let’s change one of the cells to contain a date persisted in ISO-8601 format.

[code=xml]<row r="9">
  <c r="A9" s="1" t="d">
    <v>2009-04-09T01:02:03.04Z</v>
  </c>
</row>[/code]

So the cell contains an ISO-8601 date and it is formatted using the same number format as the other cells. I have added a bit of additional data to the spreadsheet to illustrate the problem with using formulas on these values.

Result

The interesting thing to investigate iswhat happens when this cell is loaded in a popular OOXML-supporting application. Note here that the existing corpus of implementations supporting OOXML supports the initial edition of OOXML, ECMA-376 1st Ed.So they would have no way to look into the specification and see what to do with a cell containing an ISO/IEC 8601 date value.

Microsoft Excel 2007 SP2

As you can see Excel 2007 screws up the content of the cell. And on top of that, should you try to manipulate the content of the cells with formulas, they are also basically useless. The trouble? Well, you are not notified that Excel 2007 does not know how to handle the content of the cell, so chances are that you’ll never find out – until you find yourself in a position where there are real consequences to the faulty data and kittens are killed.

OpenOffice 3.0.1 Calc

 

 

The result here is almost the same. Data is lost and the user is not notified.

NeoOffice for Mac

 

Again we see the same result. This is not so strange, since the latest version of NeoOffice shares the same code base as OOo 3.0.1 so behavious should be the same.

iWorks 09 Numbers

 



Wow, so for iWorks on the Mac, the user is actually warned that something went wrong. Only trouble is - it does not warn you that the content of the cell is not valid - it informs you that the system cannot find the font "Calibri".

Conclusion

It is pretty hard to conclude enything but "this sucks!". None of the applications warn the user that they have lost data - and they all do exactly that - loose data.

Original file: Book1.xlsx (8.82 kb)

Modified file: 

book2.xlsx (8.22 kb)

The complexity of SpreadsheetML - oh the sheer joy of it!

Having a bit of time on my hands while attending the SC34/WG4-meeting in Okinawa, I thought I'd write up a blog post I have wanted to write in quite some time.

The reason for me doing this was a requirement I am often presented by CIBER's customers - export my data to Excel. The data they want us to export are traditionally grouped into three categories:

  • Text (strings)
  • Numbers
  • Dates

Creating cells with numbers and text are really a no-brainer in OOXML. It is a bit more complicated when it comes to dates, because dates in e.g. ISO 8601-format are not as such supported as "built-in cell data types" in SpreadsheetML. Instead, dates are presented by styling content in number-cells. This means that to be able to display a date in SpreadsheetML, you need to be know "a bit" about styling in spreadsheets.

Now, as some of you remember, representation of dates in spreadsheets using OOXML is done in "serial form" meaning that dates are stored as numbers. These numbers are also known as "Julian days" - not to be mistaken with the "Julian Calendar". In even other words a date is represented as the number of days since some starting point in time.

So if I wanted to store the date "December 20nd 2009" in OOXML, I would have to convert it to a "julian representation" - in this case "40167". This is really just a minor annoyance - the conversion is trivial and a no-brainer. However - the fun has not started yet.

If you look at the markup required, it would have to be like this:

[code:xml]<sheetData>
  <row r="1">
    <c r="A1">
      <v>40167</v>
    </c>
  </row>
</sheetData>[/code]

So this will give me a cell with a serial representation of 2009-12-22. However, if I open this in an OOXML-compliant application, it will display "40167". As I mentioned above, it turns out that displaying the serial representation as a "proper date" requires styling of the cell content.

The key is an attribute on the <c>-element I omitted in the example above.

[code:xml]<sheetData>
  <row r="1">
    <c r="A1" s="0">
      <v>40167</v>
    </c>
  </row>
</sheetData>[/code]

The "s"-attribute specified the style for the given cell. The specefication says this for this particular attribute:

The index of this cell's style. Style records are stored in the Styles Part.

Ok - cool so the good thing here is, that we now know what the attribute is used for. The bad thing is that we don't know anything about "how".

Styles for SpreadsheetML are described in section 3.8. The complete section is about 110 pages and it describes at length each element name and attribute but again it more answers "what" than "how".

(I just talked to another delegate about if a standard should describe both the hows and the whats, and it seems that the jury is still out on that one, so these are simply my personal observations of using the specification to solve a concrete problem).

So in figuring out how to do this, a good starting point would be to look at the list of valid child elements. These are defined as

[code:xml]<complexType name="CT_Stylesheet">
  <sequence>
    <element name="numFmts" type="CT_NumFmts" minOccurs="0" maxOccurs="1"/>
    <element name="fonts" type="CT_Fonts" minOccurs="0" maxOccurs="1"/>
    <element name="fills" type="CT_Fills" minOccurs="0" maxOccurs="1"/>
    <element name="borders" type="CT_Borders" minOccurs="0" maxOccurs="1"/>
    <element name="cellStyleXfs" type="CT_CellStyleXfs" minOccurs="0" maxOccurs="1"/>
    <element name="cellXfs" type="CT_CellXfs" minOccurs="0" maxOccurs="1"/>
    <element name="cellStyles" type="CT_CellStyles" minOccurs="0" maxOccurs="1"/>
    <element name="dxfs" type="CT_Dxfs" minOccurs="0" maxOccurs="1"/>
    <element name="tableStyles" type="CT_TableStyles" minOccurs="0" maxOccurs="1"/>
    <element name="colors" type="CT_Colors" minOccurs="0" maxOccurs="1"/>
    <element name="extLst" type="CT_ExtensionList" minOccurs="0" maxOccurs="1"/>
  </sequence>
</complexType>[/code]

The elements that should (ahem) draw attention to them are "cellStyles", "cellStyleXfs" and "cellXfs".So, if you want to apply formatting directly to a cell, look at e.g. the element <cellXfs> defined in section 3.8.10. It says (in abstract)

This element contains the master formatting records (xf) which define the formatting applied to cells in this workbook. These records are the starting point for determining the formatting for a cell. Cells in the Sheet Part reference the xf records by zero-based index.

The <cellXfs>-element has a child element called <xf>. The element is defined as

[code:xml]<complexType name="CT_Xf">
  <sequence>
    <element name="alignment" type="CT_CellAlignment" minOccurs="0" maxOccurs="1"/>
    <element name="protection" type="CT_CellProtection" minOccurs="0" maxOccurs="1"/>
    <element name="extLst" type="CT_ExtensionList" minOccurs="0" maxOccurs="1"/>
  </sequence>
  <attribute name="numFmtId" type="ST_NumFmtId" use="optional"/>
  <attribute name="fontId" type="ST_FontId" use="optional"/>
  <attribute name="fillId" type="ST_FillId" use="optional"/>
  <attribute name="borderId" type="ST_BorderId" use="optional"/>
  <attribute name="xfId" type="ST_CellStyleXfId" use="optional"/>
  <attribute name="quotePrefix" type="xsd:boolean" use="optional" default="false"/>
  <attribute name="pivotButton" type="xsd:boolean" use="optional" default="false"/>
  <attribute name="applyNumberFormat" type="xsd:boolean" use="optional"/>
  <attribute name="applyFont" type="xsd:boolean" use="optional"/>
  <attribute name="applyFill" type="xsd:boolean" use="optional"/>
  <attribute name="applyBorder" type="xsd:boolean" use="optional"/>
  <attribute name="applyAlignment" type="xsd:boolean" use="optional"/>
  <attribute name="applyProtection" type="xsd:boolean" use="optional"/>
</complexType>[/code]

The attribute you want here is "numFmtId". The attribute is described as "Id of the number format (numFmt) record used for this cell format".

(are we getting there soon?)

Anywho, going to the reference of numFmt will lead you to paragraph 3.8.30 numFmt (Number Format) and it will tell you, that some of the values of the attribute are implied. That's really just another way of saying "reserved values". 

ID
formatCode
 
 0
 General
 1  0
 2  0.00
 3  #,##0
 4  #,##0.00
 9  0%
 10  0.00%
 11  0.00E+00
 12  # ?/?
 13  # ??/??
 14  mm-dd-yy
 15  d-mmm-yy
 16  d-mmm
 17  mmm-yy
 18  h:mm AM/PM
 19  h:mm:ss AM/PM
 20  h:mm
 21  h:mm:ss
 22  m/d/yy h:mm
 37  #,##0 ;(#,##0)
 38  #,##0 ;[Red](#,##0)
 39  #,##0.00 ;(#,##0.00)
 40  #,##0.00 ;[Red](#,##0.00
 45  mm:ss
 46  [h]:mm:ss
 47  mmss.0
 48  ##0.0E+0
 49  @


It looks like id 15 could be the one we are looking for. So I'm gonna add this number format to the xf-elements's numFmt-attribute and create this xml-fragment:

[code:xml]<cellXfs count="2">
  <xf numFmtId="15" (...)  />
</cellXfs>[/code]

Behold - it actually works. When I load this in Microsoft Office 2007, it will display this:



So what have I learned here (apart from the astounding complexity of this relatively trivial task)? Well, to display a date using SpreadsheetML, you need to know a bit about SpreadsheetML styles. You will also need to do a fair amount of digging in the specification as well as in existing OOXML-files, since I could not find this information anywhere. Luckily for you, the content of this blog is licensed under Creative Commons attribution license, so feel free to use it however you should wish to do so.

To sum it all up, you will need the following items to display a cell in SpreadsheetML:

1. The cell fragment

[code:xml]<sheetData>
  <row r="1">
    <c r="A1" s="0">
      <v>40167</v>
    </c>
  </row>
</sheetData>[/code]

Notice that the cell is styled using the attribute "s" with a value of "0".

2. The style part

[code:xml]<styleSheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
  <cellXfs count="1">
    <xf numFmtId="15" (...) />
  </cellXfs>
</styleSheet>[/code]

Notice that index "0" of the <cellXfs>-collection has a numFmt-attribute with the value "15" resulting in displaying the date correctly.

I have created a small test file based on the walk-through above and it is available here: test_dates.xlsx (2.25 kb).

And in other news:

So, you might ask, how is this done using other document formats? Well, it turns out to be drastically less complex.

ODF

[code:xml]<table:table-row>
  <table:table-cell office:value-type="date" office:date-value="2009-12-20">
    <text:p>20-12-09</text:p>
  </table:table-cell>
</table:table-row>[/code]

OOXML IS29500

[code:xml]<sheetData>
  <row r="1">
    <c r="C4" t="d">
      <v>1976-11-22T08:30Z</v>
    </c>
  </row>
</sheetData>[/code] 

Both examples above should require no additional formatting.

You might also ask, if this could have been done in any other way in OOXML? Well, as far as I read the specification, there is no way around the style-part-trouble. But you could create your own number formatting if you should wish so. I would actually prefer this angle, since it would be a step away from pre-determined (implied) values in styles and keep the package content self-contained.

You know, this could actually be the basis for a nice new defect report for WG4: "Remove all implied values in the specification and move them to the transitional Part 4".

Is there an end of it?

I know this was quite a lenghty post - but is it of any value at all - and would you like more of these investigative posts in the future?

Smile