# MIME Parsing and Manipulation - Python

MIME (Multipurpose Internet Mail Extensions) is a standard for sending multipart multimedia data through Internet mail. This standard exposes mechanisms for specifying and describing the format of Internet message bodies.

A MIME-encoded message looks similar to the following:

Content-Type: multipart/mixed; boundary="====_238659232=="
Date: Mon, 03 Apr 2000 18:30:23 -0400
From: Andre Lessa <alessa@lessaworld.com>
To: Renata Lessa <rlessa@lessaworld.com>
Subject: Python Book
—====_238659232==
Content-Type: text/plain; charset="us-ascii"

Sorry Honey, I am late for dinner. I am still writing Chapter 13. Meanwhile, take a look at the following Cooking material that you've asked me to find in the Internet.

—====_238659232==
Content-Type: application/msword; name="cookmasters.doc"
Content-Transfer-Encoding: base64
Content-Disposition: attachment, filename=" cookmasters.doc"
GgjEPgkwIr4G29m1Lawr7GgjEPgkwIr4G29m14 tifkAb3qPgGgjEPgkwIr4G29m1La29m14tifkAb
3qPgGgjEPgkwIr4G29m1Law29m14tifkAb3qPgG gjEPgkwIr4G29m1Lawr629m14tifkAb3qPgIr4
G29m1Lawr2GgjEPgkwIr4G29m1Lawr29m14tif kAb3qPg29m14tifkAb3qPgGgjEPgkwIr4G29m1L
awr8Ab3qPgGgjEPgkwIr4G29m1GgjEPgkwIr4G2 9m1Lawr7GgjEPgkwIr4G29m1Hawr0==
—====_238659232==—

Note that the message is broken into parts, and each part is delimited by a boundary. The boundary itself works like a separator, and its value is defined in the first line of the message, right next to the first content-type.

Every part starts with a boundary mark, and then it is followed by a set of RFC822 headers telling you what is the content-type and the encoding format of the data for that part, and next, separated by a blank line, we have the data itself.

Check out the last line of the message. Do you see the trailing — after the boundary? That's how the message identifies the final boundary. The next couple of modules are tools for mail and news message processing that use MIME messages.

MIME Parsing and Manipulation

MIME (Multipurpose Internet Mail Extensions) is a standard for sending multipart multimedia data through Internet mail. This standard exposes mechanisms for specifying and describing the format of Internet message bodies.

A MIME-encoded message looks similar to the following:

Content-Type:multipart/mixed; boundary="====_238659232=="
Date: Mon, 03 Apr 2000 18:30:23 -0400
From: Andre Lessa <alessa@lessaworld.com>
To: Renata Lessa <rlessa@lessaworld.com>
Subject: Python Book
—====_238659232==
Content-Type: text/plain; charset="us-ascii"

orry Honey, I am late for dinner. I am still writing Chapter 13. Meanwhile, take a look at the following Cooking material that you've asked me to find in the Internet.

—====_238659232==
Content-Type: application/msword; name="cookmasters.doc"
Content-Transfer-Encoding: base64
Content-Disposition: attachment, filename=" cookmasters.doc" GgjEPgkwIr4G29m1Lawr7GgjEPgkwIr4G29m14tifkAb 3qPgGgjEPgkwIr4G29m1La29m14tifkAb
3qPgGgjEPgkwIr4G29m1Law29m14tifkAb3qPgGgjEPg kwIr4G29m1Lawr629m14tifkAb3qPgIr4
G29m1Lawr2GgjEPgkwIr4G29m1Lawr29m14tifkAb3qP g29m14tifkAb3qPgGgjEPgkwIr4G29m1L
awr8Ab3qPgGgjEPgkwIr4G29m1GgjEPgkwIr4G29m1La wr7GgjEPgkwIr4G29m1Hawr0==
—====_238659232==—

Note that the message is broken into parts, and each part is delimited by a boundary. The boundary itself works like a separator, and its value is defined in the first line of the message, right next to the first content-type.

Every part starts with a boundary mark, and then it is followed by a set of RFC822 headers telling you what is the content-type and the encoding format of the data for that part, and next, separated by a blank line, we have the data itself.

Check out the last line of the message. Do you see the trailing — after the boundary? That's how the message identifies the final boundary. The next couple of modules are tools for mail and news message processing that use MIME messages. For more information, check out RFC 1521

rfc822

The rfc822 module parses mail headers that are defined by the Internet standard RFC 822. This standard specifies the syntax for text messages that are sent among computer users, within the framework of electronic mail.

These headers are used in a number of contexts including mail handling and in the HTTP protocol. This module defines a class, Message, which represents a collection of email headers. It is used in various contexts, usually to read such headers from a file. This module also defines a helper class AddressList for parsing RFC822 addresses. A dictionary-like object represents the Message object, where the message headers are the dictionary keys.

mimetools

The mimetools module provides utility tools for parsing and manipulation of MIME multipart and encoded messages. This module contains a special dictionary -like object called Message that collects some information about MIME encoded messages. mime-version,content -type, charset, to, date, from, and subject are some examples of dictionary keys that the object possesses. This module also implements some utility functions. The choose_boundary() function creates a unique boundary string.

The next two functions encode and decode file objects based on the encoding format, which can be "quotedprintable", "base64", or "uuencode".

• decode(inputfileobject, outputfileobject, encoding)
• encode(inputfileobject, outputfileobject, encoding)

The functions copyliteral (input, output) and copybinary (input, output) read the input file (until EOF) and write them to the output file object. Note that the objects must be opened. Take a look at the message = mimetools.Message (fileobject) function. This function returns a Message object derived from the rfc 822 .Message class. Therefore, it supports all the methods supported by rfc 822 .Message, plus the following ones:

• message.gettype()— Returns the type/subtype from the content-type header. The default value is text/plain.
• message.getencoding()— Returns the message encoding method. The default value is 7bit.
• message.getplist()— Returns the list of parameters from the content-type header.
• message.getmaintype()— Returns the main type of the content-type header. The default value is text.
• message.getsubtype()— Returns the subtype of the content-type header. The default value is plain.
• message.getparam(name)— Returns the value of the first name parameter found in the content-type header.

MimeWriter

The MimeWriter module implements a generic file-writing class, also called MimeWriter, that is used to create MIME encoded multipart files (messages).

message = MimeWriter.MimeWriter(fileobject_forwriting)

The following function adds a header line ("key: value") to the MIME message.

If prefix = 0, the header line is appended to the end; if it is 1, the line is inserted at the start.

Next, you have some methods that are exposed by the message object. message.flushheaders()— Writes all headers to the file.

message.startbody(ctype [,plist [,prefix = 1]])— Specifies the content-type, and a list of additional parameters to be included in the message body. It returns a file-like object that must be used to write to the message body.

message.startmultipartbody(subtype [,boundary [,plist [,prefix = 1]]])— Specifies the multipart subtype, a possible user-defined boundary, and a list of additional parameters to be included in the multipart message subtype. It returns a file-like object that must be used to write to the message body.

• message.nextpart()— Creates a new part in a multipart message. The startbody method must be called before calling this one.
• message.lastpart()— Indicates the last part of a multipart message.

The next code introduces the basic usage of the MimeWriter module, along with other supporting modules.

import MimeWriter
import quopri, base64
msgtext="This message has 3 images as attachments."
files =["sun.jpg", "rain.jpg", "beach.jpg"]
mimefile ="mymessage.msg"
mimemsg = MimeWriter.MimeWriter(sys.stdout)
mimemsg.startmultipartbody("mixed")
msgpart = mimemsg.nextpart()
msgpart.startbody("text/plain")
quopri.encode(StringIO.StringIO(msgtext),mimefile, 0)
for file in files:
msgpart= mimemsg.nextpart()
msgpart.startbody("text/jpeg")
base64.encode(open(file, "rb"), mimefile)
mimemsg.lastpart()

multifile

The multifile module enables you to treat distinct parts of a text file as file-like input objects. Usually, it uses text files that are found in MIME encoded messages. This module works by splitting a file into logical blocks that are delimited by a unique boundary string. Next, you will be exposed to the class implemented by this module:

MultiFile.

MultiFile (fp[, seekable])

Create a multifile. You must instantiate this class with an input object argument for the MultiFile instance to get lines from, such as a file object returned by open() .Multi File only looks at the input object'sreadline(), seek(), and tell() methods, and the latter two are only needed if you want random access to the individual MIME parts. To use MultiFile on a non-seekable stream object, set the optional seekable argument to false; this will prevent using the input object's seek() and tell() methods.

It will be useful to know that in MultiFile's view of the world, text is composed of three kinds of lines: data, section -dividers, and end -markers. Multi File is designed to support parsing of messages that might have multiple nested message parts, each with its own pattern for section-divider and end-marker lines.

A MultiFile instance has the following methods:

push(str) — Pushes a boundary string. When an appropriately decorated version of this boundary is found as an input line, it will be interpreted as a section -divider or end-marker. All subsequent reads will return the empty string to indicate end-of-file, until a call to pop() removes the boundary or a next() call re -enables it.

It is possible to push more than one boundary. Encountering the most -recently -pushed boundary will return EOF; encountering any other boundary will raise an error. readline(str) — Reads a line. If the line is data (not a section-divider, end-marker, or real EOF), return it. If the line matches the most-recently-stacked boundary, return '' and set self.last to 1 or 0 according to if the match is or is not an end-marker. If the line matches any other stacked boundary, raise an error. On encountering end-of-file on the underlying stream object, the method raises Error unless all boundaries have been popped.

readlines(str)— Returns all lines remaining in this part as a list of strings.

• read()— Reads all lines, up to the next section. Returns them as a single (multiline) string. Note that this doesn't take a size argument.
• next()— Skips lines to the next section (that is, reads lines until a section-divider or end-marker has been consumed). Returns true if there is such a section, false if an end-marker is seen. Reenables the most-recently-pushed boundary.
• pop()— Pops a section boundary. This boundary will no longer be interpreted as EOF. seek(pos[, whence])— Seeks. Seek indices are relative to the start of the current section. The pos and whence arguments are interpreted as if for a file seek. tell()— Returns the file position relative to the start of the current section.
• is_data(str)— Returns true if str is data and false if it might be a section boundary. As written, it tests for a prefix other than - at the start of a line (which all MIME boundaries have), but it is declared so that it can be overridden in derived classes.

Note: Note that this test is intended as a fast guard for the real boundary tests; if it always returns false, it will merely slow processing, not cause it to fail.

section_divider(str)— Turns a boundary into a section-divider line. By default, this method prepends - (which MIME section boundaries have), but it is declared so that it can be overridden in derived classes. This method needs not append LF or CR-LF because a comparison with the result ignores trailing whitespace.

end_marker(str)— Turns a boundary string into an end-marker line. By default, this method prepends - and appends - (similar to a MIME-multipart end-of-message marker), but it is declared so that it can be overridden in derived classes. This method need not append LF or CR-LF, because a comparison with the result ignores trailing whitespace.

Finally, MultiFile instances have two public instance variables:

• level— This is the nesting depth of the current part.
• last— True if the last end-of-file was for an end-of-message marker.

The following code exemplifies the multifile module.

import multifile
import rfc822, cgi
multipart = "multipart/"
filename=open("mymail.msg")
msg = rfc822.Message(filename)
if msgtype[:10] == multipart:
multifilehandle = multifile.MultiFile(filename)
multifilehandle.push(args["boundary"])
while multifilehandle.next():
msg = rfc822.Message(multifilehandle)
multifilehandle.pop()
else:
print "This is not a multi-part message!"
print "---------------------------------"
• Line 6: msg is a dictionary-like object. You can apply dictionary methods to this object, such as msg.keys(),msg.values(), and msg.items().
• Line 8: Parses the content-type header.
• Lines 11-16: Handles the multipart message.
• Line 15: Prints the multipart message.
• Line 20: Prints the plain message, when necessary.

mailcap

The mailcap module is used to read mailcap files and to configure how MIME-aware applications react to files with different MIME types.

Note: Mailcap files are used to inform applications, including mail readers and Web browsers, how to process files with different MIME types. A small section of a mailcap file looks like this:

image/jpeg; imageviewer %s
application/zip; gzip %s

The next code demonstrates the usage of the mailcap module.

>>>import mailcap
>>>capsdict = mailcap.getcaps()
>>>command, rawentry = mailcap.findmatch(capsdict,"image/jpeg",
filename="/usr/local/uid213")
>>> print command
imageviewer /usr/local/uid213
>>> print rawentry
image/jpeg; imageviewer %s

The getcaps() function reads the mailcap file and returns a dictionary mapping MIME types to mailcap entries; and the findmatch() function searches the dictionary for a specific MIME entry, returning a command line ready to be executed along with the raw mailcap entry.

mimetypes

The mimetypes module supports conversions between a filename or URL and the MIME type associated with the filename extension. Essentially, it is used to guess the MIME type associated with a file, based on its extension. For example,

Filename extension MIME type associated(Main type/Sub type)

.html text/html
.gif image/gif
.xml application/xml

A complete list of extensions and their associated MIME types can be found by typing import mimetypes

for EXTENSION in mimetypes.types_map.keys():
print EXTENSION,"=",mimetypes.types_map[EXTENSION]

Next, you have a list of functions exposed by the mimetypes module.

mimetypes.guess_type(url_or_filename)— Returns a tuple (type, encoding), such as

('image/jpeg',None)and('application/zip',None).

mimetypes.guess_extension(type)— Tries to guess the file extension based on a MIME type. mimetypes.init([files])— Initializes the module after reading a file stored in the following format:

type/subtype: extension1, extension2, …

mimetypes.read_mime_types(filename)— Reads a file and returns a dictionary mapping MIME types and the filename extensions associated to that type.

The following dictionaries are also exposed by the mimetypes module.

• mimetypes.suffix_map— Dictionary that maps suffixes to suffixes.
• mimetypes.encodings_map— Dictionary that maps encoding types to filename extensions.
• mimetypes.types_map— Dictionary that maps MIME types to filename extensions.

base64 The base64 module performs base64 encoding and decoding of arbitrary binary strings into text string that can be safely emailed or posted. This module is commonly used to encode binary data in mail attachments.

The arguments of the next functions can be either filenames or file objects. The first argument is open for reading:

base64.encode(messagefilehandle,outputfilehandle)

The second argument is open for writing:

base64.decode(encodedfilehandle,outputfilehandle)

This module also implements the functions encode string (stringtoencode) and decode string (encodedstring), which are built on top of the encode and decode function. Both internally use the StringIO module in order to enable the use of the base64 module to encode and decode strings. Note that the decodestring() function returns a string that contains the decoded binary data.

quopri

The quopri module performs quoted -printable transport encoding and decoding of MIME quoted -printable data, as defined in RFC 1521: "MIME (Multipurpose Internet Mail Extensions) Part One". The quoted-printable encoding is designed for data in which there are relatively few non printable characters; the base64 encoding scheme available via the base64 module is more compact if there are many such characters, as when sending a graphics file. This format is primarily used to encode text files.

decode (input, output) decodes the contents of the input file and writes the resulting decoded binary data to the output file. input and output must either be file objects or objects that mimic the file object interface.

input will be read until input.read() returns an empty string. encode (input, output, quotetabs) encodes the contents of the input file and writes the resulting quoted -printable data to the output file. input and output must either be file objects or objects that mimic the file object interface. input will be read until input .read() returns an empty string.

This module only supports file-to-file conversions. If you need to handle string objects, you need to convert them using the StringIO module.

import quopri
quopri.encode(infile,outfile,tabs=0)
quopri.decode(infile,outfile)

This module is purely based on plain U.S. ASCII text. Non -U.S. characters are mapped to an = followed by two hexadecimal digits. The = character resembles = 3D, and whitespaces at the end of lines are represented by =20.

mailbox

The mailbox module implements classes that allow easy and uniform access to read various mailbox formats in a UNIX system.

import mailbox
mailboxname = "/tmp/mymailbox"
mbox = mailbox.UnixMailbox(open(mailboxname))
msgcounter = 0
while 1:
mailmsg = mbox.next()
if not mailmsg:
break
msgcounter = msgcounter + 1
print messagebody
print
print "The message counter is %d" % (msgcounter)

mimify

The mimify module has functions to convert and process simple and multi-part mail messages to/from MIME format—messages are converted to plain text. This module can be used either as a command line tool, or as a regular Python module.

To encode, you need to type:

$mimify.py -e raw_message mime_message or import mimify, StringIO, sys msgfilename = "msgfilename.msg" filename = StringIO.StringIO() mimify.unmimify(msgfilename,filename, 1) file.seek(0) mimify.mimify(filename,sys.stdout) To decode, type$mimify.py -f mime_message raw_message
or
import mimify, sys
mimify.unmimify(messagefilename,sys.stdout, 1)
Message(file[, seekable])

A Message instance is instantiated with an input object as parameter. Message relies only on the input object having a readline() method; in particular, ordinary file objects qualify. Instantiation reads headers from the input object up to a delimiter line (normally a blank line) and stores them in the instance. This class can work with any input object that supports a readline() method. If the input object has seek and tell capability, the rewindbody() method will work; also, illegal lines will be pushed back onto the input stream. If the input object lacks seek and tell capability but has an unread() method that can push back a line of input, Message will use that to push back illegal lines. Thus, this class can be used to parse messages coming from a buffered stream.

The optional seekable argument is provided as a workaround for certain studio libraries in which tell() discards buffered data before discovering that the lseek() system call doesn't work. For maximum portability, you should set the seekable argument to zero to prevent that initial tell() when passing in an unseekable object such as a file object created from a socket object.

Input lines as read from the file might either be terminated by CR -LF or by a single linefeed; a terminating CR -LF is replaced by a single linefeed before the line is stored.

All header matching is done independent of upper- or lowercase; for example, m ['From'], m['from'], and m ['FROM'] all yield the same result.

AddressList (field) — You can instantiate the AddressList helper class using a single string parameter, a comma -separated list of RFC 822 addresses to be parsed. (The parameter None yields an empty list.)

parsedate(date) — attempts to parse a date according to the rules in RFC 822. However, some mailers don't follow that format as specified, so parsedate() tries to guess correctly in such cases. date is a string containing an RFC 822 date, such as 'Mon, 20 Nov 1995 19:12:08 -0500'. If it succeeds in parsing the date, parsedate() returns a 9-tuple that can be passed directly to time.mktime(); otherwise None will be returned.

parsedate _tz(date) — performs the same function as parsedate(), but returns either None or a 10 -tuple; the first nine elements make up a tuple that can be passed directly to time .mktime(), and the tenth is the offset of the date's timezone from UTC (which is the official term for Greenwich Mean Time). (Note that the sign of the timezone offset is the opposite of the sign of the time.timezone variable for the same timezone; the latter variable follows the POSIX standard, whereas this module follows RFC 822.) If the input string has no timezone, the last element of the tuple returned is None.

mktime _tz(tuple) — Turn a 10-tuple as returned by parsedate _tz() into a UTC timestamp. It the timezone item in the tuple is None, assume local time. Minor deficiency: this first interprets the first eight elements as a local time and then compensates for the timezone difference;

this might yield a slight error around daylight savings time switch dates. It is not enough to worry about for common use.

Message Objects

A message object behavior is very similar to a dictionary. A Message instance has also the following methods:

rewindbody()— Seeks to the start of the message body. This only works if the file object is seekable.

isheader(line)— Returns a line's canonicalized fieldname (the dictionary key that will be used to index it) if the line is a legal RFC822 header; otherwise returns None (implying that parsing should stop here and the line be pushed back on the input stream). It is sometimes useful to override this method in a subclass.

islast(line)— Returns true if the given line is a delimiter on which Message should stop. The delimiter line is consumed, and the file object's read location is positioned immediately after it. By default, this method just checks that the line is blank, but you can override it in a subclass.

iscomment(line)— Returns true if the given line should be ignored entirely, just skipped. By default, this is a stub that always returns false, but you can override it in a subclass.

getallmatchingheaders(name)— Returns a list of lines consisting of all headers matching name, if any. Each physical line, whether it is a continuation line or not, is a separate list item. Returns the empty list if no header matches name.

getfirstmatchingheader(name)— Returns a list of lines comprising the first header matching name, and its continuation line(s), if any. Returns None if no header matches name.

getrawheader(name)— Returns a single string consisting of the text after the colon in the first header matching name. This includes leading whitespace, the trailing linefeed, and internal linefeeds and whitespace if any continuation line(s) were present. Returns None if no header matches name.

getheader(name[, default])— Similar to getrawheader(name), but strips leading and trailing whitespace. Internal whitespace is not stripped. The optional default argument can be used to specify a different default to be returned when there is no header matching name.

get(name[, default])— An alias for getheader(), to make the interface more compatible with regular dictionaries.

getaddr(name)— Returns a pair (full name, email address) parsed from the string returned by

getheader(name). If no header matching name exists, returns (None, None); otherwise both the full name and the address are (possibly empty) strings.

Example: If m's first From header contains the string 'alessa@lessaworld.com (Andre Lessa)', m.getaddr('From') will yield the pair ('Andre Lessa', 'alessa@lessaworld.com'). If the header contained 'Andre Lessa <alessa@lessaworld.com>' instead, it would yield the exact same result.

If multiple headers exist that match the named header (for example, if there are several CC headers), all are parsed for addresses. Any continuation lines that the named headers contain are also parsed.

Note that the current version of this function is not really correct. It yields bogus results if a full name contains a comma.

getdate(name)— Retrieves a header using getheader() and parses it into a 9-tuple compatible with time.mktime(). If no header matches name, or it is unparsable, returns None. Date parsing appears to be a black art, and not all mailers adhere to the standard. Although it has been tested and found correct on a large collection of email from many sources, it is still possible that this function might occasionally yield an incorrect result.

getdate_tzname)— Retrieves a header using getheader() and parses it into a 10-tuple; the first nine elements will make a tuple compatible with time.mktime(), and the 10th is a number giving the offset of the date's timezone from UTC. Similar to getdate(), if no header matches name, or it is unparsable, it returns None.

Message instances also support a read-only mapping interface. In particular: m[name] is similar to m.getheader(name), but raises KeyError if there is no matching header; and len(m), m.has _key (name), m.keys() ,m.values(), and m.items() act as expected (and consistently).

Finally, Message instances have two public instance variables:

• headers—A list containing the entire set of header lines, in the order in which they were read (except that setitem calls can disturb this order). Each line contains a trailing newline. The blank line terminating the headers is not contained in the list.
• fp—The file object passed at instantiation time.

An AddressList instance has the following methods:

• _str__(name)— Returns a string representation of the address list. Addresses are rendered in "name"<host@domain>form, comma separated.
• _sub__(name)— Returns an AddressList instance that contains every address in the lefthand AddressList operand that is not present in the right-hand address operand (set difference).

Finally, AddressList instances have one public instance variable: addresslist, which is a list of tuple string pairs, one per address. In each member, the first is the canonicalized name part of the address, the second is the route-address (@-separated host-domain pair).

The following example demonstrates the use of the rfc822 module:

import rfc822
mailbox_filename= "mymailbox.msg"
file_handle= open("mailbox_filename")
messagedic= rfc822.Message(file_handle)
content_type = messagedic["content-type"]
from_field = messagedic["From"]