a 'mooh' point

clearly an IBM drone

The complexity of SpreadsheetML - oh the sheer joy of it!

Having a bit of time on my hands while attending the SC34/WG4-meeting in Okinawa, I thought I'd write up a blog post I have wanted to write in quite some time.

The reason for me doing this was a requirement I am often presented by CIBER's customers - export my data to Excel. The data they want us to export are traditionally grouped into three categories:

  • Text (strings)
  • Numbers
  • Dates

Creating cells with numbers and text are really a no-brainer in OOXML. It is a bit more complicated when it comes to dates, because dates in e.g. ISO 8601-format are not as such supported as "built-in cell data types" in SpreadsheetML. Instead, dates are presented by styling content in number-cells. This means that to be able to display a date in SpreadsheetML, you need to be know "a bit" about styling in spreadsheets.

Now, as some of you remember, representation of dates in spreadsheets using OOXML is done in "serial form" meaning that dates are stored as numbers. These numbers are also known as "Julian days" - not to be mistaken with the "Julian Calendar". In even other words a date is represented as the number of days since some starting point in time.

So if I wanted to store the date "December 20nd 2009" in OOXML, I would have to convert it to a "julian representation" - in this case "40167". This is really just a minor annoyance - the conversion is trivial and a no-brainer. However - the fun has not started yet.

If you look at the markup required, it would have to be like this:

[code:xml]<sheetData>
  <row r="1">
    <c r="A1">
      <v>40167</v>
    </c>
  </row>
</sheetData>[/code]

So this will give me a cell with a serial representation of 2009-12-22. However, if I open this in an OOXML-compliant application, it will display "40167". As I mentioned above, it turns out that displaying the serial representation as a "proper date" requires styling of the cell content.

The key is an attribute on the <c>-element I omitted in the example above.

[code:xml]<sheetData>
  <row r="1">
    <c r="A1" s="0">
      <v>40167</v>
    </c>
  </row>
</sheetData>[/code]

The "s"-attribute specified the style for the given cell. The specefication says this for this particular attribute:

The index of this cell's style. Style records are stored in the Styles Part.

Ok - cool so the good thing here is, that we now know what the attribute is used for. The bad thing is that we don't know anything about "how".

Styles for SpreadsheetML are described in section 3.8. The complete section is about 110 pages and it describes at length each element name and attribute but again it more answers "what" than "how".

(I just talked to another delegate about if a standard should describe both the hows and the whats, and it seems that the jury is still out on that one, so these are simply my personal observations of using the specification to solve a concrete problem).

So in figuring out how to do this, a good starting point would be to look at the list of valid child elements. These are defined as

[code:xml]<complexType name="CT_Stylesheet">
  <sequence>
    <element name="numFmts" type="CT_NumFmts" minOccurs="0" maxOccurs="1"/>
    <element name="fonts" type="CT_Fonts" minOccurs="0" maxOccurs="1"/>
    <element name="fills" type="CT_Fills" minOccurs="0" maxOccurs="1"/>
    <element name="borders" type="CT_Borders" minOccurs="0" maxOccurs="1"/>
    <element name="cellStyleXfs" type="CT_CellStyleXfs" minOccurs="0" maxOccurs="1"/>
    <element name="cellXfs" type="CT_CellXfs" minOccurs="0" maxOccurs="1"/>
    <element name="cellStyles" type="CT_CellStyles" minOccurs="0" maxOccurs="1"/>
    <element name="dxfs" type="CT_Dxfs" minOccurs="0" maxOccurs="1"/>
    <element name="tableStyles" type="CT_TableStyles" minOccurs="0" maxOccurs="1"/>
    <element name="colors" type="CT_Colors" minOccurs="0" maxOccurs="1"/>
    <element name="extLst" type="CT_ExtensionList" minOccurs="0" maxOccurs="1"/>
  </sequence>
</complexType>[/code]

The elements that should (ahem) draw attention to them are "cellStyles", "cellStyleXfs" and "cellXfs".So, if you want to apply formatting directly to a cell, look at e.g. the element <cellXfs> defined in section 3.8.10. It says (in abstract)

This element contains the master formatting records (xf) which define the formatting applied to cells in this workbook. These records are the starting point for determining the formatting for a cell. Cells in the Sheet Part reference the xf records by zero-based index.

The <cellXfs>-element has a child element called <xf>. The element is defined as

[code:xml]<complexType name="CT_Xf">
  <sequence>
    <element name="alignment" type="CT_CellAlignment" minOccurs="0" maxOccurs="1"/>
    <element name="protection" type="CT_CellProtection" minOccurs="0" maxOccurs="1"/>
    <element name="extLst" type="CT_ExtensionList" minOccurs="0" maxOccurs="1"/>
  </sequence>
  <attribute name="numFmtId" type="ST_NumFmtId" use="optional"/>
  <attribute name="fontId" type="ST_FontId" use="optional"/>
  <attribute name="fillId" type="ST_FillId" use="optional"/>
  <attribute name="borderId" type="ST_BorderId" use="optional"/>
  <attribute name="xfId" type="ST_CellStyleXfId" use="optional"/>
  <attribute name="quotePrefix" type="xsd:boolean" use="optional" default="false"/>
  <attribute name="pivotButton" type="xsd:boolean" use="optional" default="false"/>
  <attribute name="applyNumberFormat" type="xsd:boolean" use="optional"/>
  <attribute name="applyFont" type="xsd:boolean" use="optional"/>
  <attribute name="applyFill" type="xsd:boolean" use="optional"/>
  <attribute name="applyBorder" type="xsd:boolean" use="optional"/>
  <attribute name="applyAlignment" type="xsd:boolean" use="optional"/>
  <attribute name="applyProtection" type="xsd:boolean" use="optional"/>
</complexType>[/code]

The attribute you want here is "numFmtId". The attribute is described as "Id of the number format (numFmt) record used for this cell format".

(are we getting there soon?)

Anywho, going to the reference of numFmt will lead you to paragraph 3.8.30 numFmt (Number Format) and it will tell you, that some of the values of the attribute are implied. That's really just another way of saying "reserved values". 

ID
formatCode
 
 0
 General
 1  0
 2  0.00
 3  #,##0
 4  #,##0.00
 9  0%
 10  0.00%
 11  0.00E+00
 12  # ?/?
 13  # ??/??
 14  mm-dd-yy
 15  d-mmm-yy
 16  d-mmm
 17  mmm-yy
 18  h:mm AM/PM
 19  h:mm:ss AM/PM
 20  h:mm
 21  h:mm:ss
 22  m/d/yy h:mm
 37  #,##0 ;(#,##0)
 38  #,##0 ;[Red](#,##0)
 39  #,##0.00 ;(#,##0.00)
 40  #,##0.00 ;[Red](#,##0.00
 45  mm:ss
 46  [h]:mm:ss
 47  mmss.0
 48  ##0.0E+0
 49  @


It looks like id 15 could be the one we are looking for. So I'm gonna add this number format to the xf-elements's numFmt-attribute and create this xml-fragment:

[code:xml]<cellXfs count="2">
  <xf numFmtId="15" (...)  />
</cellXfs>[/code]

Behold - it actually works. When I load this in Microsoft Office 2007, it will display this:



So what have I learned here (apart from the astounding complexity of this relatively trivial task)? Well, to display a date using SpreadsheetML, you need to know a bit about SpreadsheetML styles. You will also need to do a fair amount of digging in the specification as well as in existing OOXML-files, since I could not find this information anywhere. Luckily for you, the content of this blog is licensed under Creative Commons attribution license, so feel free to use it however you should wish to do so.

To sum it all up, you will need the following items to display a cell in SpreadsheetML:

1. The cell fragment

[code:xml]<sheetData>
  <row r="1">
    <c r="A1" s="0">
      <v>40167</v>
    </c>
  </row>
</sheetData>[/code]

Notice that the cell is styled using the attribute "s" with a value of "0".

2. The style part

[code:xml]<styleSheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
  <cellXfs count="1">
    <xf numFmtId="15" (...) />
  </cellXfs>
</styleSheet>[/code]

Notice that index "0" of the <cellXfs>-collection has a numFmt-attribute with the value "15" resulting in displaying the date correctly.

I have created a small test file based on the walk-through above and it is available here: test_dates.xlsx (2.25 kb).

And in other news:

So, you might ask, how is this done using other document formats? Well, it turns out to be drastically less complex.

ODF

[code:xml]<table:table-row>
  <table:table-cell office:value-type="date" office:date-value="2009-12-20">
    <text:p>20-12-09</text:p>
  </table:table-cell>
</table:table-row>[/code]

OOXML IS29500

[code:xml]<sheetData>
  <row r="1">
    <c r="C4" t="d">
      <v>1976-11-22T08:30Z</v>
    </c>
  </row>
</sheetData>[/code] 

Both examples above should require no additional formatting.

You might also ask, if this could have been done in any other way in OOXML? Well, as far as I read the specification, there is no way around the style-part-trouble. But you could create your own number formatting if you should wish so. I would actually prefer this angle, since it would be a step away from pre-determined (implied) values in styles and keep the package content self-contained.

You know, this could actually be the basis for a nice new defect report for WG4: "Remove all implied values in the specification and move them to the transitional Part 4".

Is there an end of it?

I know this was quite a lenghty post - but is it of any value at all - and would you like more of these investigative posts in the future?

Smile

Comments (23) -

Hi Jesper,

Thanks for the writeup. Regarding the s="0" style reference, is this an array with 0 as the first item in the cellXfs list/collection? Its interesting that styles are referred to by their index positions instead of names. I guess Im used to modern CSS ways of doing things ...

[btw, there is a small typo" zcellXfs> above]

yk

Hi Yoon Kit,

Yes - the index points into a collection of cell styles. So you would essentially have a collection like

  <cellXfs count="4">
    <xf numFmtId="15" (...) />
    <xf numFmtId="14" (...) />
    <xf numFmtId="2" (...) />
  </cellXfs>

The spec states:

Cells in the Sheet Part reference the xf records by zero-based index.

(as mentioned above)

Smile

For getting data in and from spreadsheets is why the Open XML SDK is so
blogs.msdn.com/.../...data-from-spreadsheetml.aspx

This is a fascinating post Jesper -- and yes, more please - in fact can we send you our OOXML problems for blog-post solutions in future? Wink

The extent to which ODF and OOXML get traction will depend to some extend on just such details as you have been exploring. As an XML-head, you created your document "as" XML, and it was -- a chore.

You also noted that the ODF equivalent was easier.

So, when Office supports ODF I can well imagine XML-minded developers preferring to create their Office documents using ODF, as it's easier (or more natural), and developers are lazy ... or at least good ones are.

On the other hand some developers (not XML-heads) will say "XML pshuii", and reach for the SDK instead. In that case the complexity of OOXML will not really matter to them ...

I continue to believe the easier-handling of ODF will count in its favour. I've already come across one commercial product where as a first step to processing Word XML documents, the developers used OpenOffice.org to convert them into more tractable ODF. Lazy, see ...

- Alex.


hAl,

For getting data in and from spreadsheets is why the Open XML SDK is so
blogs.msdn.com/.../...data-from-spreadsheetml.aspx


I did know about the article, but ... well, I think Zeyad is "cheating" when it comes to dates. If you download the code and look at line 46, you will find this code fragment:

                        if (definedName.Key.Contains("Date"))
                        {
                            DateTime d = new DateTime(1900, 1, 1);
                            d = d.AddDays(Int32.Parse(cellValue) - 1);
                            cellValue = d.ToShortDateString();
                        }

In other words, Zeyad has defined named ranges that connect cells with date values with cell-names containing the word "Date". So he is not really using OOXML to figure out if he is looking at a cell with a date. He is using the fact that he has created a pre-defined template with matching names ranges that allow him to deduct date-content from the name of the cell.

Smile

But Jesper, programming is all about cheating.

You can always use extra knowledge which you got for what the program is used for.
If I make program for things that contain informational date information like birthdays then I can easily skip the year 1900 and make the application useless for people that are 108 years or older.

hAl,

Yes - that was why I put quotes around the word "cheating". My point was not that the article was useless - quite the contrary. It just didn't help me at all with figuring out how dates are handled in OOXML.

As I wanted to be able to create a spresdsheet programatically without use of Microsoft Office, when I saw the article was based on a template generated by Microsoft Office, the value to me (and my task) went belly-up in a split-second.

(and as you perhaps noticed as well, the article assumes that the base year for this spreadsheet is 1900-01-01. That is also not necessarily always true)

Alex,

Please do send me more "challenges" if you encounter them. The best resource for OOXML-related issues is pages and fora scattered around microsoft.com, but they mostly deal with more "top-level" tasks like "how do you read from a SpreadsheetML-document using LINQ?" or "How do you add a paragraph to a WordpressingML-document?". They do not deal so much with the format itself, so if I can provide some useful information, I'd be happy to do so.

Smile

Jesper,

Thanks for this post, it really helped in a lot of ways. I am trying to figure out general formatting for an OOXML document and have found it to be quite challenging as there arent' samples out there or anything that helps. I have tried to read through the specification but I have abondoned that as I don't have the time to read through all of that.

I have started over the last couple off days to make some progress on it and this post will really help with that!

If you have done any other work on general styling it would help if you could post it here!

Thanks,

Marais

I suppose you post is about the spreadsheet part of pre-ISO OOXML?

Confusing with two rather different formats with the same name

/esni

Eskild,

Yes - my complaint is about pre-ISO OOXML, aka ECMA-376 1st Ed. Examples of how the same task is done in ODF and post-BRM OOXML, aka ECMA-376 2nd Ed, is at the end of my post.

Smile

I wouldn't say though, that the two formats are "rather different", but, true, substantial changes to the format has indeed been made.

Hello Jesper,
This post is quite helpful in understanding how OOXML deal with formats, especially adding specific formats to cells, but I have another concern that is how can we retrieve the cell content with formats. Let's say if the value of attribute "numFmtId" is included in the "reserved values", we can easily know what's the format. but if it's not "reserved" one, but custom one, the value of FormatCode will be a string, indicating the format. In this case, it's really hard for us to understand the format. When we read the FormatCode programmatically, we even have no idea if the cell has a data value or currency value, or others. Do you have any ideas on this issue, thanks very muchSmile

Jesper - thank you for the list of reserved format codes, I've obviously been looking in the wrong place; it all makes sense now!

Nick.

Could I also vote up Goodol's comment re. identifying the type of the value from the number format - how do I know the value is a number or a (Julian) date, do we have to parse each format and try and decipher it?

Well,

I asked Microsoft some time ago about how they do it, and for them it is a "lucky-guess" process or "trial and error".

I do not know of any bullet-proof way to infer a data type from e.g. a custom format, but if the spreadsheet uses one of the pre-defined ones, you could make a better guess. If e.g. "21" is used, it is likely a Time datatype, but if it is using "15", it is probably a Date. Number "48" seems to be a large integer.

Thanks for the reply Jesper - I was hoping that wasn't going to be the answer Wink

I'm happy to code the necessary lookup for the 'reserved' formats to their indicative types, however the non-reserved formats becomes a little more complicated. Apart from trying to convert something and catching an error, I'm not too sure how to approach this - my only concern is that this exception based approach will be a inefficient.

Any thoughts?

Hi Nick,

From looking at the list of number formats, if I were to implement it, this I'd do:

1. Infer the data type from the style index in the list above. This should be relatively easy to group the styles in "Dates", "Numbers", "Dates and times" etc.

2. If I encounter a custom number format, then I'd just bail and display the raw value of the cell. There is no way you could write a robust algorithm that would result in satisfactory conversions, and the people using custom formats would likely know that this is an interop-barrier.

You could, of course, just try to convert and then catch any errors. There are likely not that many different types to check for, so it wouldn't really be a big deal. Also, C# (if you use that) has a "TryParse" method on the most used primitives and that one is a bit more "cheap" to use than throwing an exception.

In all cases you are in muddy waters when it comes to custom formats. Because not only do you have to infer the data type of the cell bases on something fuzzy - you will likely also need to be able to display it correctly afterwards.

PS: Happy election day Smile

Thanks for the quick (informative) reply - that sounds like a good approach.

As for voting, that'll be later today. Fingers crossed we get a 'good' outcome, whatever that may be!

No problem Smile I'm glad I could help.

Hi,

I have a requirements that I need to create an HTML table out of the excel 2007 content using OOXML. The problem is I need to create the exact replica of the Excel in the HTML TABLE. But when I take the values from each cell, i am not able to get the Formats .

Eg: I have a value $22.00 in excel, but when I take that cell value programatically i am getting only 22.  I need to take $22.00 and put in to HTML table. Please help

Thanks in advance.

Ali

Hi Ali,

Note that the formatted string is generated in Excel at run-time and is not as such persisted in the document itself.

So what you need to do it trace/follow the style of the cell in question and find the used number-format-pattern. Then you apply that pattern in your own code (which should result in the formatted string) and then save that string in your  HTML-output.

So a real-world scenario would be this:

* Content in cell is 59

* You follow the style and discover that it uses numFormat 14 in the table above. This means that the number 59 represents a date.

* You convert 59 to a date. in .Net this would be using the method DateTime.FromOADate() and get "1900-02-28".

* This string/date is saved in your HTML-file.

Does this help?

(it's a bit tricky - I know)

Smile

Hey thanks a lot for the reply. But I found out another solution for this problem. I am getting the "FormatCode" of all the Cells which is having the attribute  "StyleIndex ". Then by using the "Microsoft.Office.Interop.Excel", there is a method called WorksheetFunction.Text("Value","FormatCode"), I can convert the value to the corresponding format.

But the problem now I am facing is that, I am not able to convert Percentage and Scientific values, since it does not have a FormatCode.

If you have any idea to solve this, please help me.

Thanks in advance.

Hi Ali,

about your alternative approach: well, just consider my input as the "document-format way to do it". I am glad that Microsoft has made some code that makes it a bit less trouble-some to do.

Smile

About you other questions - let me think about it an get back to you ... I don't have a definitive answer at hand.

Comments are closed