Parsing Whatsapp Messages
- 3 minutes read - 438 words
Export Chat
WhatsApp export format might change in future. The code presented in this post is tested on 2.21.14.24
version of whatsapp client. To export the chat open the chat window and select more
from the option menu. There you will see the Export Chat
option. Select it to initiate the export without including any media object. The initialization will take some time. Finally you’ll get the option to save or share the exported txt
file.
Reading Lines
The exported file is just a text file where some lines are the start of a message and some are the continuation of one. First step is to read the whole file and keep everything in memory line by line.
f = open('exported.txt', encoding='utf-8')
all_lines = f.readlines()
f.close()
To identify the lines that start a message I’m using a hit and trial method. I assume that every line is the start of a message and try to parse the date from the start of the message. If it fails I skip that line otherwise mark that index as the start of a message. Not an ideal solution but works in most cases and for small files.
# index of message starting
msg_indices = []
# Get message indices
for i in range(len(all_lines)):
l = all_lines[i]
d = l.split('-')
if len(d) >= 2:
# MM/DD/YY, HH:MM AM/PM
possible_dt = d[0].strip()
try:
datetime_object = datetime.strptime(possible_dt, '%m/%d/%y, %I:%M %p')
msg_indices.append(i)
except ValueError:
pass
Senders
To group all messages by sender I use message indices and get the sender name, then with the name as key I save the message information which is nothing but a tuple containing (start_index, end_index, timestamp) of the message.
# messages from sender
msg_sender_indices = {}
# Filter indices from selected sender
for i in range(len(msg_indices)):
l = all_lines[msg_indices[i]]
msg_date = datetime.strptime(l.split('-')[0].strip(), '%m/%d/%y, %I:%M %p')
msg_sender = (l.split('-')[1]).split(':')[0].strip()
if i < len(msg_indices)-1:
r = (msg_indices[i], msg_indices[i+1]-1, msg_date)
else:
r = (msg_indices[i], len(all_lines)-1, msg_date)
if msg_sender in msg_sender_indices:
msg_sender_indices[msg_sender].append(r)
else:
msg_sender_indices[msg_sender] = [r]
Messages
With this information I can easily look for a sender’s message and reconstruct it with the information available in the tuple.
msgs = []
for msg_info in msg_sender_indices['sender_name']:
start_index = msg_info[0]
end_index = msg_info[1]
msg = msg_info[2].strftime('%m/%d/%y') + " \n"
for i in range(start_index, end_index + 1):
if i == start_index:
msg += ''.join(''.join(all_lines[i].split('-')[1:]).split(':')[1:])
else:
msg += all_lines[i]
msgs.append(msg)
Application
Getting structured time series data out of exported chat can be used to do a lot of interesting NLP analysis. Like word clouds, sentiment analysis or chat frequency based on time. I did it to make a collage of quotes I received in a group chat.