How to Build and Deploy a Lyrics Generation Model — Framework Agnostic

Keyboard

You’ll find tons of article about how to build a machine learning model. You’ll find a bit less article on how to consume it intelligently. And you’ll find almost no article about how to serve it from scratch.

I’ll detail the steps that took us to the product you can see above: raplyrics.eu Some repo are open source and available on GitHub. Update: We propose a more cost-effective and more straightforward way to serve your machine learning project in this post.

What ?

With a good friend of mine we really love to listen to rap music. Rap music is powerful because it has the power of creating savage punchline with only a few words. Since it is still hard to generate long texts with RNN, we believed rap music was a great candidate.

raplyrics final product
Figure 1: The final product

How ?

The big picture

raplyrics architecture
Figure 2: Project architecture

I won’t be too descriptive about implementation in the post since we tried to be exhaustive inside the code repositories, see READMEs. I will insist on the tipping points that were challenging for us. Basic sysadmin knowledge and unix proficiency will help.


1- Data Extraction and processing

GitHub repository: RapLyrics-Scraper

— scraping

raplyrics scrapping
Figure 3: Project architecture

First, we need a dataset to train our neural network. Luckily enough Genius.com has tons of lyrics available online and even a nice API. It may not be designed to scrap lyrics but with some workarounds we managed to build a lyrics scraper on top of it. Check the source code or reach out in comments if you need technical details. After multiple shots, we realized that it’s really important to focus on a high-quality dataset for natural language processing. We decided to focus on the 60 most popular songs of 40 US artists.

✔ That’s it for the scrapping.

— pre-processing

The scraping part provides us with a .txt dataset. We now have to clean it — i.e. remove non lyrical content: ©, ®, Credits, typos and various spellings of the same word. Think about gettin', getting and stuff like this. Methodology we followed:
  1. Identify patterns to eliminate
  2. Craft regex catching those patterns — resource for regex testing: pythex.org
  3. Use a text editor to perform those regexes directly on the dataset
If you want to automate regex cleaning, be aware that it is risky. You’ll have to thoroughly consider the order in which you perform your regular expressions.

— augmenting the dataset

[optional]
We chose only artists with really meaningful lyrics, and we selected their most popular songs. That does not make a huge corpus. Hence, we decided to perform a step of data augmentation to virtually increase the size of our dataset.
📖 Data augmentation means increasing the number of data points. In our context, it means increasing the number of sentences.
We copied our dataset, shuffled all the verses and pasted this back at the end of the original dataset.
You can find a snippet on how to shuffle paragraphs here.
With this trick we double the size of our dataset. This will have a positive impact on the training of the neural networks. Indeed, each new batch is different due to the shuffling, so the network weights are updated with different inputs.

✔ That’s it for the data augmentation.

2- Building a lyrics generative model

GitHub repository: RapLyrics-Back

— dimensioning the text generative model

raplyrics training infra
Figure 4: Training setup
Many neural networks implementation are available online. We chose one and fine-tuned it to fit our need: textgenrnn — a python project for text-generation using neural networks. You can find a basic description of the model’s hyperparameters and the training settings in our code repositories READMEs The purpose of this article is not to deep-dive into neural networks design. The implementation won’t be detailed. You can check in the source code or ping us in the comments.

— training the text generative model

Depending on your dataset and your configuration, you may consider cloud computing to speed-up the training. We used aws — Amazon Web Service. If you train your model locally — you can skip this part. Otherwise, consider that the following part will get a bit technical. I will detail our training setup in more details since it is something which took us time to get right. We launched an aws ec2 spot instance to reduce our cost. We need at least 3gb ram and the 8gb default ssd is enough. The training was not GPU accelerated (a point of amelioration).
raplyrics training infra
Figure 5: 83% savings on a spot instance comparing to a classic ec2 instance
How is an ec2 spot instance different from a classical ec2 instance?
You bid for an ec2 instance with certain specs and, as long as your bid is above the average market price, you have an instance behaving like a classic ec2.
If your bid is below the market price you instance is terminated after a short notice. More info on spot instances. We made a spot request, it was fulfilled in no time and then we cloned our repo and installed a python3 virtual env with all the project requirements.
Note: You need to enable your instance to write on s3 bucket if you want to save your model checkpoints (as seen 👇) texgenrnn saves a model checkpoint at each epoch. To cope with the risk of instance termination and save our checkpoints in a safe place, we use aws cli to copy the checkpoints in an aws s3 bucket. cd to your checkpoint files and copy them to your s3 bucket.
bash

# run `pip install awscli` beforehand
aws s3 cp my-checkpoint-file.ckpt s3://my-s3-bucket/model-saves/
    
Note: To make this possible, you need to grant write access to your ec2 to instance. To do this, add a role to your ec2 instance with s3 full access and ec2 full access policies as described in the screenshot below.
AWS policy
Figure 6: Attach those 2 policies to the IAM role attached to your ec2 instance.

3- Serving the text generative model

For the purpose of providing users with better lyrics, we use a custom generation function.

AWS policy
Figure 7: Model serving strategy
We serve the app using gunicorn over Flask. The idea is not to reload the model at each API call — which would lead to long response time. We restore the session only once at the app initialization, and it persists between API calls.
Demo of the call to the API and its response.
Inference test
Figure 8: Test backend inference with curl

4- Plugging in the front end

GitHub repository: RapLyrics-Front
This part describes the necessary steps for an apache web-server. If you don’t have it:
bash

sudo apt-get install apache
    
Move all the files to be served from RapLyrics-Front to /var/www/html/ Remember to update the "url" settings of your endpoint in the index.html That’s it, you’re done (kind of). You can now access the website by accessing your server ip in a web browser.

— production set-up

[optional]

These are the next steps if you want to have the front-end and back-end on the same machine with a https connection.

  1. Let’s encrypt our website 🔒 → follow the steps in How To Secure Apache with Let’s Encrypt (Digital Ocean has really awesome tutorials)
  2. Our index.html served by apache calls raplyrics.eu/apiUS when the user submits an input. In fact, there is no /apiUS route on apache. We need to redirect this call to the gunicorn server running on this very same machine. This is what is called reverse-proxying.
Let’s handle these two steps. Since the code is related our prod apache configuration, it is not version controlled.
Go to /etc/apache2/sites-available
You should see a 000-default.conf and a 000-default-le-ssl.conf file. They are template files handling configuration on how apache will serve your http and https (le-ssl) website. We make a copy of them for our website. (replace raplyrics.eu with your domain name 👇)
bash

sudo cp 000-default.conf raplyrics.eu.conf
sudo cp 000-default-le-ssl.conf raplyrics.eu-le-ssl.conf
    

i) Redirect traffic from http to https

Edit raplyrics.eu.conf to include the rewrite conditions below:
<VirtualHost *:80>
 
<some code>
RewriteEngine on
RewriteCond %{SERVER_NAME} =raplyrics.eu [OR]
RewriteCond %{SERVER_NAME} =www.raplyrics.eu
RewriteRule ^ https://%{SERVER_NAME}%{REQUEST_URI} [END,NE,R=permanent]
</VirtualHost>
Rewrite rules to redirect traffic from http to https. Remember to replace raplyrics.eu by your site name.

ii) Reverse proxy the API call

Edit raplyrics.eu-le-ssl.conf to include the proxy reverse instructions.
<IfModule mod_ssl.c>
<VirtualHost *:443>
 <some code>
 ProxyPreserveHost On
 ProxyPass /apiUS http://127.0.0.1:8000/apiUS
 ProxyPassReverse /apiUS https://127.0.0.1:8000/apiUS
<some code>
</VirtualHost>
</IfModule>
Reverse proxy to redirect the API call to the gunicorn app
It is here that we handle the proxy pass from raplyrics.eu/apiUS to the local gunicorn server at 127.0.0.1:8000 Now we tell apache to update the website configuration:
bash

sudo a2ensite raplyrics.eu.conf
sudo a2ensite raplyrics.eu-le-ssl.conf
    
Finally, sudo systemctl restart apache2 to take the changes into account. That’s it, you’re in production. 🚀 You can check ours on raplyrics.eu

References