0

Current Problem

I've pieced together a script which downloads attachments from a mailbox in gmail and for the most part pulls a list of variables based on the email the information is pulled from. However I found that in some cases the "Message ID" of an email could be listed as "Message ID" or "Message-ID". Because of this I've tried to use regex to take into account that there could be anything between "Message" and "ID" but my code spits out errors regardless of what I have tried to so far with the expression.

Error

> Traceback (most recent call last):   File "email-downloader.py", line
> 64, in <module>
>     msg_id = str(email_message).split("Message+\.*: ", 1)[1].split("\n", 1)[0] IndexError: list index out of range

What I've tried

I've looked online and wasn't able to find an answer in the past which was able to resolve this issue. I also tried to amend the regex with different + placements or the use of \ and []

Code

        email_from = str(email_message).split("From: ", 1)[1].split("\n", 1)[0]
        subject = str(email_message).split("Subject: ", 1)[1].split("\n", 1)[0]
        ext = os.path.splitext(fileName)[1]
        delivered = str(email_message).split("Date: ", 1)[1].split("\n", 1)[0]
        msg_id = str(email_message).split("Message+\.*: ", 1)[1].split("\n", 1)[0]

        print('File: "{file}".'.format(file=fileName))
        print('Ext: "{ext}".'.format(ext=ext))
        print('Subject: "{subject}".'.format(subject=subject))
        print('From: "{email_from}".'.format(email_from=email_from))
        print('Date Delivered: "{delivered}".'.format(delivered=delivered))
        print('Message ID: "{msg_id}".'.format(msg_id=msg_id))
        print("\n")                                                                                                                                                                                                                                  
        print('"{msg_id}"   "{delivered}"   "{file}"        "{subject}"     "{email_from}"'.format(file=fileName,subject=subject,email_from=email_from,msg_id=msg_id,delivered=delivered), file=open("array/client-ref.tsv", "a"))
        os.rename(os.path.join(dirName,fileName), os.path.join(dirName,msg_id + ext))
El_Birdo
  • 315
  • 4
  • 19

2 Answers2

1

To use regular expressions for splitting a string, you have to use the split(pattern, string) method from the re library in python. The following code should do what you want

import re
msg_id = re.split("Message.*: ", str(email_message))[1].split("\n", 1)[0]
0

The split() method does not work with regex. You would need to import the regex library re and use re.split() in order to achieve what you want. If "Message ID" and "Message-ID" are the only two possibilities, you don't have to use regex though. You could first replace one expression with the other, and then split the text:

msg_id = str(email_message).replace('Message-ID', 'Message ID', 1)
msg_id = msg_id.split("Message ID", 1)[1].split("\n", 1)[0]

As a side note, I don't know what type email_message is, but it makes sense to convert it to a str only once and store it in another variable e.g., if you need email_message in its original type later on. I wouldn't recomment to convert it to a str more than once.

mapf
  • 1,906
  • 1
  • 14
  • 40
  • worked a treat. I'm still only sampling some emails of a larger batch at the moment so there could be other variations down the line. Do you know if `replace` supports the option for multiple values or if there is a different method one would need to use? First time trying to write python so I'm still trying to get my head around what I can and can't get away with. – El_Birdo May 04 '20 at 08:17
  • 1
    `replace` also only takes one argument. But if you would like to be more flexible, I recommend you look at [this answer](https://stackoverflow.com/a/59072515/5472354). It also mentions [flashtext](https://www.analyticsvidhya.com/blog/2017/11/flashtext-a-library-faster-than-regular-expressions/), wich is appearently even faster than regex, but I haven't tried it myself. – mapf May 04 '20 at 08:23