Friday 5 June 2015

Extracting Pictures from MS Office (2007)

It extracts the pictures or it gets the hose! Er, Sorry about that ... Python can be a little unco-operative at times ;)


A MS Office (2007) document is comprised of a group of files zipped together into one archive file. Pictures are stored in a "media" subfolder and are linked to the document via relationships declared in various XML files. A quick Google did not find an existing Python script to extract MS Office (2007) pictures, so this post intends to show how we can create a basic image extraction Python script (msoffice-pic-extractor.py). You can download it from my GitHub page.

This post was inspired after Jared Greenhill (@jared703) retweeted a David Koepi (@davidkoepi) tweet containing this link .

So thanks to them, monkey had a reason to get off the couch ... and sit in front of a PC instead :)

We begin by unzipping the content of the various MS Office files (.docx, .xlsx, .pptx) and noting how they are arranged. You can use 7-zip (in Windows) or the Archive Manager (in Ubuntu) to view an MS Office document's component files/sub-directories.

 MS Word 2007

Word images are stored under the zip archive's word/media directory and are named generically. eg image1.jpeg

Word images are stored under the word/media directory


Image metadata is stored in word/document.xml using the <wp:docPr> XML element tag.

Word image metadata is stored in word/document.xml

This metadata includes the source picture's filename under the "descr" attribute. For example:
<wp:docPr id="1" name="Picture 0" descr="Hex-and-BADCOFFEE.png"/>

MS Powerpoint 2007

Powerpoint images are stored under the ppt/media directory and are named generically. eg image1.jpeg

Powerpoint images are stored under the ppt/media directory



Image metadata is stored (per slide) under the ppt/slides/ directory. Each slide's XML file is named generically. eg slide1.xml, slide2.xml

Powerpoint image metadata is stored per slide in ppt/slides/


Image metadata for slides are stored using the <p:cNvPr> XML element tag. This metadata includes the source picture's filename under the "descr" attribute. For example:
<p:cNvPr id="4" name="Picture 3" descr="Hex-and-BADCOFFEE.png"/>

Note: Both "name" and "descr" were set to string values for pictures. Other (non-picture) instances of the <p:cNvPr> element may also exist but they will not typically set both the "name" and "descr" attributes. So this gives us a tentative way of identifying picture metadata.

MS Excel 2007

Excel images are stored under the xl/media directory and are named generically. eg image1.jpeg

Excel images are stored under the xl/media directory


Image metadata is stored (per worksheet) under the xl/drawings/ directory. Each worksheet's XML file is named generically. eg drawing1.xml, drawing2.xml

Excel image metadata is stored per slide in xl/drawings/

Image metadata for worksheets are stored using the <xdr:cNvPr> XML element tag. This metadata includes the source picture's filename under the "descr" attribute. For example:
<xdr:cNvPr id="2" name="Picture 1" descr="Hex-and-BADCOFFEE.png"/>

Other Observations

It was observed that pictures inserted from source .jpg's were then stored in the zip file's media directory as .jpeg.
Pictures inserted from source .bmp and/or .png were stored as .png.
Pictures inserted from clipart .wmf were stored as .wmf. Clipart also had the path to the Clipart source file written to the "descr" attribute. eg descr = "C:\Program Files (x86)\Microsoft Office\MEDIA\CAGCAT10\j0216724.wmf"


The Script

When first researching/writing any extraction script, Google is your friend :)
Some helpful Python tips were found at StackOverflow by searching for "Python", "zip" and "namespace XML".
This post showed how we can read the files from a zipfile and extract/output selected files.
This post showed how to handle XML namespaces in an XML file. This is relevant because the element tags containing the source picture's filename are declared using XML namespaces.
So for .docx files, the <wp:docPr> tag is used for picture metadata. The "wp" represents the namespace and the "docPr" is the element name. Namespaces are used so that you can have multiple elements with the same name so long as they are in different namespaces. eg domain1:petmonkey_name, domain2:petmonkey_name.

The msoffice-pic-extractor.py script takes two arguments:
- the target filename of the MS Office 2007 file (or it can be the name of a single level directory containing multiple MS Office 2007 files)
- the destination directory for extracting the pictures to. The pictures are extracted to a sub-directory with the same name as the source MS Office file. The extracted files will be labelled like image1.jpeg etc.

Here's the script's help text:
cheeky@ubuntu:~$ python msoffice-pic-extractor.py -h
usage: msoffice-pic-extractor.py [-h] target destdir

Extracts pics from given MS Office document

positional arguments:
  target      MS Office document / directory of Office documents to be searched
  destdir     output dir

optional arguments:
  -h, --help  show this help message and exit
cheeky@ubuntu:~$

The script tries to detect whether the "target" argument is a directory. If its not detected as a directory, it is assumed to be a single file. The file extension is then checked and the parse_docx / parse_xlsx / parse_pptx functions are called as required.
If the "target" is a directory, then the script walks through the files in the directory and calls the appropriate parse functions based on the file extension.

Note: The script does not currently handle nested subdirectories - it ass-umes all files are contained in the root of the directory specified.

The parse functions are all very similar - we probably could have had one function and passed it different arguments to indicate the filetype but for initial testing/debugging, it was quicker/simpler to have separate parse functions.
Anyhoo, each parse function follows this basic pattern:
- Checks that the file is a valid zip file using the zipfile.is_zipfile() function
- Creates a zipfile object via the zipfile.ZipFile() function
- Uses zipfile.infolist() to list the file contents of the zip file. It then checks for the picture metadata XML file (eg word/document.xml) and prints out the relevant metadata. For any pictures stored in the media directory, it also calls zipfile.read() to retrieve the contents and then writes the contents to a new file in the "destdir" directory.

Checking for the picture metadata involves calling ElementTree.parse to parse the appropriate XML file and then extracting/printing out any picture elements. Looking at the .docx parsing code, we need to extract the "name" and "descr" attributes from any "wp:docPr" elements.

So the relevant code looks like this:
docdata = z.open(j.filename) # opens the picture metadata xml file using the zipfile library's open function
tree = ET.parse(docdata) # parses the XML file to get to the root/top node
root = tree.getroot()
We then specify that "wp" represents an XML namespace via the following line:
namespace = {"wp" : "http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"}
 and now we call the findall function to return a list (called picdatas) of all "<wp:docPr>" elements:
picdatas = root.findall(".//wp:docPr", namespace)
Then we can iterate through each item in the list and print the "name" and "descr" attributes (if they are both set):

for picdata in picdatas: #id="4" name="Picture 3" descr="Penguins.jpg"
    name = picdata.get("name")
    descr = picdata.get("descr")
    if (name is not None) and (descr is not None):
        print(filename + " : " + j.filename + ", name = " + name + ", descr = " + descr)
 For more information on parsing XML trees, see my previous post.

Testing

The script has been tested on Win7 x64 & Ubuntu 14.04 x64 with Python 2.7 and MS Office 2007 .docx, .xlsx, .pptx files.

For testing, we created a .docx with pictures inserted in the following order:
Hex-and-BADCOFFEE.png
squirrel-moving-acorn.bmp
squirrel-moving-acorn.png
wp-app-trawling-blk.jpg


Script use example for single .docx (testdoc.docx):
cheeky@ubuntu:~$ python msoffice-pic-extractor.py testdoc.docx testdocop

Running msoffice-pic-extractor.py v2015-05-23
Source file = testdoc.docx
Output dir = testdocop

Attempting to open single file testdoc.docx

Attempting to parse docx = testdoc.docx
Input MS Office file testdoc.docx checked OK!
Processing word/document.xml for picture metadata
testdoc.docx : word/document.xml, name = Picture 0, descr = Hex-and-BADCOFFEE.png
testdoc.docx : word/document.xml, name = Picture 1, descr = squirrel-moving-acorn.bmp
testdoc.docx : word/document.xml, name = Picture 2, descr = squirrel-moving-acorn.png
testdoc.docx : word/document.xml, name = Picture 3, descr = wp-app-trawling-blk.jpg
Extracting picture image1.png to testdocop/testdoc.docx
Extracting picture image4.jpeg to testdocop/testdoc.docx
Extracting picture image2.png to testdocop/testdoc.docx
Extracting picture image3.png to testdocop/testdoc.docx
cheeky@ubuntu:~$

Note: You can see the "name" attribute gives an general indication of the order in which the pictures were inserted into a .docx file. Also note how the "descr" values show the source image's filename.

Here's the script's output directory contents:

Extracted pictures for the first .docx version


Note: Extracted picture file types may differ from the original source file types

Later, we inserted "winphone-washer.png" after the first picture, so the order became:
Hex-and-BADCOFFEE.png
winphone-washer.png
squirrel-moving-acorn.bmp
squirrel-moving-acorn.png
wp-app-trawling-blk.jpg


We then ran the script on the new file (testdoc2.docx) ...
cheeky@ubuntu:~$ python msoffice-pic-extractor.py testdoc2.docx testdocop2

Running msoffice-pic-extractor.py v2015-05-23
Source file = testdoc2.docx
Output dir = testdocop2
Creating destination directory ...

Attempting to open single file testdoc2.docx

Attempting to parse docx = testdoc2.docx
Input MS Office file testdoc2.docx checked OK!
Processing word/document.xml for picture metadata
testdoc2.docx : word/document.xml, name = Picture 0, descr = Hex-and-BADCOFFEE.png
testdoc2.docx : word/document.xml, name = Picture 4, descr = winphone-washer.png
testdoc2.docx : word/document.xml, name = Picture 1, descr = squirrel-moving-acorn.bmp
testdoc2.docx : word/document.xml, name = Picture 2, descr = squirrel-moving-acorn.png
testdoc2.docx : word/document.xml, name = Picture 3, descr = wp-app-trawling-blk.jpg
Extracting picture image1.png to testdocop2/testdoc2.docx
Extracting picture image5.jpeg to testdocop2/testdoc2.docx
Extracting picture image4.png to testdocop2/testdoc2.docx
Extracting picture image3.png to testdocop2/testdoc2.docx
Extracting picture image2.png to testdocop2/testdoc2.docx
cheeky@ubuntu:~$

The output directory looked like:

Extracted pictures for the second .docx version (added winphone-washer.png)

We can see from the "name" values that Picture 4 (winphone-washer.png) was added after Pictures 0 to 3.

Script use example for single .pptx (testppt.pptx):
For testing, we created a .pptx with the pictures in the following order -
Hex-and-BADCOFFEE.png (slide1)
squirrel-moving-acorn.bmp and squirrel-moving-acorn.png (both on slide2)
wp-app-trawling-blk.jpg (slide3)


Running the script:
cheeky@ubuntu:~$ python msoffice-pic-extractor.py testppt.pptx testppop

Running msoffice-pic-extractor.py v2015-05-23
Source file = testppt.pptx
Output dir = testppop
Creating destination directory ...

Attempting to open single file testppt.pptx

Attempting to parse pptx = testppt.pptx
Input MS Office file testppt.pptx checked OK!
Processing ppt/slides/slide2.xml for picture metadata
testppt.pptx : ppt/slides/slide2.xml, name = Content Placeholder 3, descr = squirrel-moving-acorn.bmp
testppt.pptx : ppt/slides/slide2.xml, name = Picture 4, descr = squirrel-moving-acorn.png
Processing ppt/slides/slide3.xml for picture metadata
testppt.pptx : ppt/slides/slide3.xml, name = Content Placeholder 3, descr = wp-app-trawling-blk.jpg
Processing ppt/slides/slide1.xml for picture metadata
testppt.pptx : ppt/slides/slide1.xml, name = Picture 3, descr = Hex-and-BADCOFFEE.png
Extracting picture image3.png to testppop/testppt.pptx
Extracting picture image2.png to testppop/testppt.pptx
Extracting picture image1.png to testppop/testppt.pptx
Extracting picture image4.jpeg to testppop/testppt.pptx
cheeky@ubuntu:~$

In contrast to the .docx file, the "name" values seem to vary depending on the source file type (or perhaps the position on the slide? eg title vs body) so we can't ascertain the order in which they were added. The output file names seem to confirm the order of appearance however. Also note how the "descr" values show the source image's filename.

The output directory looked like:

Extracted pictures for the test .pptx file



Script use example for single .xlsx (testxl.xlsx):
For testing, we created a .xlsx with the pictures in the following order:
Hex-and-BADCOFFEE.png (sheet1)
squirrel-moving-acorn.bmp and squirrel-moving-acorn.png (both on sheet2)
wp-app-trawling-blk.jpg (sheet3)


Running the script:
cheeky@ubuntu:~$ python msoffice-pic-extractor.py testxl.xlsx testxlop

Running msoffice-pic-extractor.py v2015-05-23
Source file = testxl.xlsx
Output dir = testxlop
Creating destination directory ...

Attempting to open single file testxl.xlsx

Attempting to parse xlsx = testxl.xlsx
Input MS Office file testxl.xlsx checked OK!
Extracting picture image4.jpeg to testxlop/testxl.xlsx
Processing xl/drawings/drawing3.xml for picture metadata
testxl.xlsx : xl/drawings/drawing3.xml, name = Picture 1, descr = wp-app-trawling-blk.jpg
Processing xl/drawings/drawing1.xml for picture metadata
testxl.xlsx : xl/drawings/drawing1.xml, name = Picture 1, descr = Hex-and-BADCOFFEE.png
Extracting picture image1.png to testxlop/testxl.xlsx
Processing xl/drawings/drawing2.xml for picture metadata
testxl.xlsx : xl/drawings/drawing2.xml, name = Picture 1, descr = squirrel-moving-acorn.bmp
testxl.xlsx : xl/drawings/drawing2.xml, name = Picture 2, descr = squirrel-moving-acorn.png
Extracting picture image2.png to testxlop/testxl.xlsx
Extracting picture image3.png to testxlop/testxl.xlsx
cheeky@ubuntu:~$

The output directory looked like:

Extracted pictures for the test .xlsx file


The "name" values appear to be reset per Excel worksheet/ XML drawing file but the numbering seems consistent with the order in which they appear. eg "Picture 1" appears before "Picture 2" on worksheet / XML drawing 2. Also note how the "descr" values show the source image's filename.

And now for the bonus party trick - processing all three file types from the same source directory with one command:

Here's what the source directory looked like:


All 3 MS Office file types in the same source directory


Running the script looks like:
cheeky@ubuntu:~$ python msoffice-pic-extractor.py testgroup testgroupop

Running msoffice-pic-extractor.py v2015-05-23
Source file = testgroup
Output dir = testgroupop
Creating destination directory ...

Attempting to parse xlsx = testxl.xlsx
Input MS Office file testxl.xlsx checked OK!
Extracting picture image4.jpeg to testgroupop/testxl.xlsx
Processing xl/drawings/drawing3.xml for picture metadata
testxl.xlsx : xl/drawings/drawing3.xml, name = Picture 1, descr = wp-app-trawling-blk.jpg
Processing xl/drawings/drawing1.xml for picture metadata
testxl.xlsx : xl/drawings/drawing1.xml, name = Picture 1, descr = Hex-and-BADCOFFEE.png
Extracting picture image1.png to testgroupop/testxl.xlsx
Processing xl/drawings/drawing2.xml for picture metadata
testxl.xlsx : xl/drawings/drawing2.xml, name = Picture 1, descr = squirrel-moving-acorn.bmp
testxl.xlsx : xl/drawings/drawing2.xml, name = Picture 2, descr = squirrel-moving-acorn.png
Extracting picture image2.png to testgroupop/testxl.xlsx
Extracting picture image3.png to testgroupop/testxl.xlsx

Attempting to parse docx = testdoc.docx
Input MS Office file testdoc.docx checked OK!
Processing word/document.xml for picture metadata
testdoc.docx : word/document.xml, name = Picture 0, descr = Hex-and-BADCOFFEE.png
testdoc.docx : word/document.xml, name = Picture 1, descr = squirrel-moving-acorn.bmp
testdoc.docx : word/document.xml, name = Picture 2, descr = squirrel-moving-acorn.png
testdoc.docx : word/document.xml, name = Picture 3, descr = wp-app-trawling-blk.jpg
Extracting picture image1.png to testgroupop/testdoc.docx
Extracting picture image4.jpeg to testgroupop/testdoc.docx
Extracting picture image2.png to testgroupop/testdoc.docx
Extracting picture image3.png to testgroupop/testdoc.docx

Attempting to parse pptx = testppt.pptx
Input MS Office file testppt.pptx checked OK!
Processing ppt/slides/slide2.xml for picture metadata
testppt.pptx : ppt/slides/slide2.xml, name = Content Placeholder 3, descr = squirrel-moving-acorn.bmp
testppt.pptx : ppt/slides/slide2.xml, name = Picture 4, descr = squirrel-moving-acorn.png
Processing ppt/slides/slide3.xml for picture metadata
testppt.pptx : ppt/slides/slide3.xml, name = Content Placeholder 3, descr = wp-app-trawling-blk.jpg
Processing ppt/slides/slide1.xml for picture metadata
testppt.pptx : ppt/slides/slide1.xml, name = Picture 3, descr = Hex-and-BADCOFFEE.png
Extracting picture image3.png to testgroupop/testppt.pptx
Extracting picture image2.png to testgroupop/testppt.pptx
Extracting picture image1.png to testgroupop/testppt.pptx
Extracting picture image4.jpeg to testgroupop/testppt.pptx

Parsed 3 MS Office files
cheeky@ubuntu:~$

Here's the output files:

Output files after group processing

For giggles, we created a Libre Office Writer document in Ubuntu, saved it as a Word 2007/2010/2013 .docx and then ran the script. The script extracted the pictures OK but the "descr" and "name" fields did not contain the same level of detail as observed for an official MS Office 2007 .docx. The "name" attribute was consistently set to "Picture" and the "descr" attribute was blank/empty. So while we may not be able to retrieve the source picture's filename, we can still extract the images.
This may also indicate that Word 2010/2013 uses the same file structure as Word 2007. So our script might be able to extract pictures from MS Office 2010/2013 documents. Meh.

Final Thoughts

Currently the msoffice-pic-extractor.py script either individual files or multiple MS Office files located under a single level "target" directory.
Resolving this issue would probably require incorporating the path into the output filename so that 2 MS Office files with the same filename but under different directories could be processed OK. Seemed a little over-complicated for such a quick script. Or maybe I was just feeling like a lazy monkey (again!).

Because MS Office can convert the inserted source pictures into a different file type for storage, any original EXIF data (eg GPS co-ordinates, camera model) will not be retained (apart from the source filename).

While the forensic uses for this project are somewhat limited (eg possible IP theft / illicit image storage), the project still provided a good learning exercise showing how we can use Python to read zip files and parse XML.
It wasn't overly-complicated (he says thanking StackOverflow profusely) but as with learning any language, practice makes perfect. The alternative view is - throw enough crap on the wall, and some of it is bound to stick to you :)