Sign in

Extract DataFrame from Email via Gmail API & Python

image from unsplash.com

This is a quick walkthrough to extrace HTML tables from your (google) emails, using Gmail API, Pandas and Python. Part of this article was taken and modified from Qiuckstart Gmail API documentation.

First, you should enable Gmail API and Download the generated credentials.json file from the API docs. Note, you can choose your OAuth client as Desktop Client. Make sure to click ‘DOWNLOAD CLIENT CONFIGURATION’.

https://developers.google.com/gmail/api/quickstart/python#step_1_turn_on_the

Next, install the required libraries:

Then, we are going to authenticate and establish a connection to Gmail API. The python file (or jupyter lab) you are writing this in, should be in the same location as the downloaded ‘credentials.json’. I have written the code below in the file named ‘quickstart.py’.

save and execute the file:

This will open a web browser, where google’s asking you to sign in to your target account. You can ignore the ‘Google hasn’t verified this app’ warning as this will be for testing/development purposes and not deployment. So, click ‘Advanced’ and on the bottom click ‘Go to Quickstart (unsafe)’. Then click ‘Allow’ and ‘Allow’ again.

If the page stuck in loading, while the python script is still running in the background, stop the browser page from loading and refresh the page. This should fix the issue. Monitor your python terminal/output to see list of your Gmail folders. similar to:

Labels:
CHAT
SENT
INBOX
IMPORTANT
TRASH

If you managed to print the folders, you have successfully authenticated Gmail API. Notice the created ‘token.pickle’ in your python directory. This enables you to authenticate without manual login next time you want to use the API. Make sure to keep credentials.json and token.pickle safe. (e.g do not push to git).

besides service.users().labels() function which we used above, we can use other functions to access other endpoints (e.g: messages/emails).

The code below lists the ids of the emails in the INBOX folder, having subject ARK Investment Management Trading Information. Add the followings to the main function or to your jupyter lab.

‘messages’ variable is a list of dictionaries, each only having id and threadId. We will use the id to download the actual content of the first email.

Finally, we can just use pandas to parse tables inside the html. Note, pandas will attempt to parse all possible tables within the html, therefore, read_html will return a list. In this case, there’s only 1 table, hence, I am choosing index [0]:

produced dataframe (image by author)

AI/ML Tutorials

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store