Office Programming in .NET: The Easy Way

Imagine you’re a .NET developer working for WeLikeReports inc. Suddenly you get taken over by BigCompany Ltd, and the new managers want to review all your documentation. Yes, I know, you never got round to writing the documentation. Just pretend you have some. All your documentation is stored as Word documents which happen to be headed with your old company name WeLikeReports, and your line manager thinks it’d really impress his new boss if that was replaced by the text BigCompany Ltd in every document. Of course you’re the one that gets asked to do the work. And being a developer, the idea of manually making the same edits to dozens of documents fills you with horror so you look for an automated solution.

Welcome to the world of Office programming.

The traditional time-honoured way to do Office programming is using the COM interop API that Word exposes.This API exposes the same objects you’d use in Word VBA macros, so they can be called from any external app that understands COM. For a C# or VB developer this seems great because it’s quite simple to invoke COM objects from .NET. And even better, Microsoft have provided primary interop assemblies (don’t ask) for Office to make it even easier. All you have to do is bring up the Add Reference dialog in Visual Studio, select the Office assemblies and you’re ready to get coding!

AddOfficeRef

First you need some code to have your .NET app launch Word (I’m using Word in this example but it’s the same principle for Excel and Powerpoint):

using Microsoft.Office.Interop.Word;
// ...
string wordFilePath = @"C:\Test Documents\SomeDoc.docx";
Application app = new Application();
Document doc = app.Documents.Open(wordFilePath);
// do whatever you want to do here. Be sure to close the doc and quit the word app afterwards.

And now you are about to descend into the World of Pain that is the Office COM interop API.

Don’t do it!

There are several problems with using Office COM interop from .NET, but they all basically stem from the fact the Office COM objects were designed back in the 1990’s to let people write macros in Office. That’s what they were for. And let’s give credit where it’s due, back in those days the idea of using Visual Basic to write macros inside an application was a huge step forward. In the 1990s enterprise architecture was in its infancy, as was the very idea of different apps talking to each other.

Trouble is, what was a great idea 15 years ago isn’t necessarily a great idea today. As an example, something you’ll quickly come up against if you use the COM interop layer is that the objects you invoke run inside a separate Word (or Excel, or Powerpoint) process. They are after all objects for Office macros. Unfortunately these objects also assume you’re sitting at the computer ready, for example, to respond to any dialogs. And yes there will be dialogs. Even if you pass in every conceivable parameter to the API functions to say ‘please don’t show me a dialog. No I don’t care if you’ve just invented the elixir of eternal youth. I don’t want to see a dialog about it,’ Word will occasionally still pop up a dialog. Imagine if your app is actually a Windows service running without UI permissions and it uses the COM interop API to generate thousands of reports at 2am every Sunday morning and one day it crashes because Word tried to pop up a dialog!

Of course, people being people, workarounds will be found. It’s been seriously suggested that you might dismiss some dialogs by programmatically simulating a user responding to them! Doesn’t that sound like madness? Well, actually that was a suggestion by none other than Microsoft.

Add to this the fact that by invoking Office you are going cross-process which of course hits performance. Then you have the documentation which was mostly written in the 1990s when standards of documentation from Microsoft were – shall we say – not as good as today. And an object model that almost defies imagination in places (if you want to manipulate text inside a paragraph, you get a Range object from the paragraph. But then a Range object itself can contain paragraphs. Great object hierarchy!).

I could go on but I think you’re getting the picture that coding with the Office COM API is not really something you want to be doing unless you really have to.

Now here’s the big secret.

There’s a much better way of programming with Office documents.

You see, if you’re reading or writing an Office document, what you’re basically doing is reading or writing a file. You don’t need to load office for that, do you. All you need to do is write a program that can parse the file format – say the .docx file, create a document object model, and you’re away. Seriously.

Hang on. Parse a .docx file? Or a .xlsx file? Has TechieSimon gone insane already, on only the 1st article?

No, listen. You may recall that when Office 2007 came out, Microsoft changed the file format. Suddenly, instead of .doc files there were these new .docx files. Instead of .xls you had .xlsx. The sometimes you had xlsm or docm. Of course at the time you probably muttered dark words about backwards compatibility and why do MS have to keep changing everything.

But there was reason. You see the old .doc files were binary files. They were proprietary. Microsoft did eventually release the format, and Joel Spolsky wrote a great article
about some of the design decisions that had gone into it, including that bits of the file was formatted so they would load straight into memory unchanged – a big performance consideration in the early 1990s. And yes, even though the format of those files was now available, it was complicated. You’d have to be an utter masochist to try to parse them yourself.

However, one reason Microsoft replace .doc with .docx files is that knew this binary format was a problem and moreover, they wanted you to be able to parse the files.

So what is a .docx file? Well, if you try and open one in a text editor (go on – try it on one of your documents you have lying around!) it looks binary too. But that’s only because it’s zipped up.

The trick is to rename it to whatever.docx.zip. Now open it. You’ll see it really is a zip archive file and it has lots of files and folders inside it. Here’s a .docx I opened earlier (Actually this very article).

OfficeDocxZip

Most of these folders contain meta and other information. But of course what you want to see is the document itself. That’s (usually) at word/document.xml, or sometimes at documents/document.xml. If I open this ‘file’ I can see this.

<?xmlversion="1.0"encoding="UTF-8"standalone="true"?>
<w:documentxmlns:wne= (lots of XML namespaces snipped) >
  <w:body>
    <w:p w:rsidP="00611137"w:rsidRDefault="004B0A60"w:rsidR="004B0A60">
      <w:pPr>
        <w:pStylew:val="Heading1"/>
      </w:pPr>
      <w:r>
        <w:t>Office Programming in .NET: The Easy Way</w:t>
      </w:r>
    </w:p>
    <w:p w:rsidRDefault="004B0A60"w:rsidR="004B0A60"/>-<w:pw:rsidRDefault="004B0A60"w:rsidR="004B0A60">
      <w:r>
         <w:txml:space="preserve">Imagine you're a .NET developer working for WeLikeReportsinc.  Suddenly you get taken over by BigCompany ltd, and the new managers want to review all your documentation. Yes, I know, you never got round to writing the documentation. Just </w:t>
...

There’s your Word document. And it doesn’t look hard to parse at all, does it! Even better, the format is an industry standard XML schema format (ECMA-376 or ISO-29500). It’s the Office Open XML standard and these .docx and .xlsx etc files are actually Open Office XML files.

Now guess what:

If your word document is in XML format, then you don’t need to use the Word COM interop API to read it. System.XML.Linq will do perfectly well. And if you don’t need the Word COM interop API, then you don’t even need Office installed. The only problem is you need to figure out the XML. And yes, if you want to process every esoteric feature of Office then you have a lot of work. Good luck to you in that case because the Open Office specifications run to well over 6000 pages! But I’m guessing like all of us you’ve bashed your way through XML files before, using common sense to figure out the bits of the schema you need, and if all you’re doing is extracting some information or maybe making some minor change to a document or a spreadsheet, that may well be enough, you won’t need to touch those 6000 page docs.

But it gets even easier than that because Microsoft have produced an API to read and write Open Office XML. It’s called the Open XML SDK (currently version 2.5
). It’s an amazing set of .NET classes that can read (or write) Office Open XML files and represent the documents as strongly typed objects, which means you don’t have to fiddle (much) with the XML. For example, the fact that a document body ( element) contains paragraphs () which in turn contain text runs () translates into an instance of the class DocumentFormat.OpenXml.Wordprocessing.Body, which contains instances of DocumentFormat.OpenXml.Wordprocessing.Paragraph and that in turn contains instances of DocumentFormat.OpenXml.Wordprocessing.Run. This means that you can write code like this to – say – insert a new paragraph into a document:

string wordDocPath = @"C:\Documents\SampleDoc.docx";
using (WordprocessingDocument doc = WordprocessingDocument.Open(wordDocPath, true))
{
	Text text = new Text("This is a new paragraph I'm inserting");
	Run run = new Run(text);
	Paragraph paragraph = newParagraph(run);
	doc.MainDocumentPart.Document.Body.InsertAt(paragraph, 0);
	doc.MainDocumentPart.Document.Save();
}

Easy, huh!

While it’s at it, the SDK also hides away other bits of nastiness like that fact that there have already been several slightly differing versions of the Open Office standard. Altogether, it’s an SDK that really deserves a lot more publicity than it’s been given. There really should be a law compelling .NET developers to learn about it before attempting to manipulate Office documents from code.

So next time you don’t fancy editing 100 documents by hand, you know what to do…

Next week: Visual Studio Monochrome (part 1).

Advertisements

4 thoughts on “Office Programming in .NET: The Easy Way

  1. Hi Simon,

    I tried your above code. It worked well but I have a bookmark in table with two columns. I have to pass a dataset, loop thru and insert multiple rows in table. Please guide me with example. Its very urgent. appreciate your reply.

    Thanks

  2. Thank you for the article! I wonder whether an open document could be edited in the same way (without using the COM interop)?

Comment on this article (your first comment on TechieSimon will be moderated)

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s