Webinar Recap: Scaling Automated Quality Text Generation for Enterprise Sites with Hamlet Batista
For this month’s DeepCrawl webinar we were very excited to be joined by Hamlet Batista, CEO of RankSense, along with DeepCrawl’s CAB Chairman, Jon Myers to talk about the practical applications of automating neural text generation. In his presentation, Hamlet explored several different deep learning approaches which enable us, as SEOs, to create compelling and high quality content at scale.
We’d like to say a big thank you to Hamlet for his great presentation and for answering our audience’s questions, as well as to Jon for hosting and all those who attended. We hope you enjoyed the webinar!
You can watch the full recording here:
Plus, revisit the slides Hamlet presented here:
Machine learning benefits for SEO
With advancements to machine learning happening everyday, Hamlet explained that gaining data science and machine learning skills will not only provide SEOs with the ability to find solutions to make their work more efficient, it also makes our job as SEOs a lot more interesting.
Writing quality content and metadata at scale is a big problem for most enterprise sites, when working with large websites there is always the possibility that you will identify missing meta tags which can be key for helping both search engines and users find your website. In addition, creating content and keeping up with the changes that happen daily on a large site can be very tedious and time consuming.
However, by leveraging the latest advancements in automation you can solve this problem and ensure your site provides high quality content, enabling you to be efficient with other necessary SEO tasks.
Common issues seen on Enterprise sites
Hamlet explained the four most common scenarios seen on enterprise websites, before exploring the different text generation approaches which can be used to address these.
With large ecommerce sites you will often find;
- Pages with large images and no text
- Pages with large images and some text
While large publisher sites struggle with the following;
- Pages with lots of quality text and no metadata
- Pages with very little text
This is where the need for automation comes into play. Using Google Colaboratory to create an image captioning and visual question answering model, as well as a state of the art text summarisation model, we will be able to automate the following tasks:
1. Image captioning
2. Visual question and answering
3. Text summarisation
4. Question and answering from text (short answers)
5. Long-form question and answering
6. Full article generation
How to find relevant research and code
Due to the speed at which advancements to deep learning are occurring, there is always something new which will allow what is already working to perform even better. It is therefore important to always remain up to date with these advancements, Hamlet recommends the website Papers with Code to keep track of these. Not only does this site provide the latest academic papers, but it also contains the relevant and accessible code necessary within the written information, in an organised format.
Text generation for Ecommerce sites
With a lot of ecommerce sites, the text which is used to describe products is often not very enticing or useful and the page can be dominated by large images. However, with image captioning and visual question answering, we can leverage these images in order to automatically generate a useful description from what is visible in the image.
Using a modular framework called Pythia, which is explained within the Bottom-up and Top-down Attention For Image Captioning paper, you will be able to ask standard questions related to what is shown within an image. By generating a templated text framework the computer is then able to answer these questions, providing a ready to use description.
Plus, there is no need to write any code before using the model, as Google Colab will enable you to clone the script to a new notebook to be run straight away.
Getting started with building a captioning model
Once selecting the Pythia BUTD captioning demo link, you will be directed to a Google Colab link where you can make a copy to save to your drive. A copy of the necessary code will then be saved into another file and from here you can select to run all cells.
This will automatically perform each step for you, starting with downloading the data sources needed to run the necessary training process, as well as automatically completing all of the steps that would typically need to be undertaken manually.
Once this has finished, a prompt will appear where you will be able to input the image you want to caption. Simply input the URL of the image, run it through the colab notebook and, without the need to write any code, a caption for the image will be automatically generated. This system will enable you to take images from the web and generate relevant and accurate captions.
Making it practical
After running a crawl of your site with DeepCrawl, ensuring you have allowed image resource crawling in the advanced settings, export the data that is relevant to the image URLs and upload these to your colab notebook. Then simply iterate these through the text generation function and export the data from here, to create a list of captions for all of the images on your site.
Use the All Pages crawl data generated by DeepCrawl to identify any pages which may be lacking a description or require an updated title, and incorporate your captioned text when creating these.
Ask questions from images
Pythia also enables you to ask questions about the images on your site. Following the same steps as before, of adding an image URL to the colab notebook, you can add a text value containing your question. The model will then make several predictions, with their most confident answer at the top of the list.
How the Pythia system works
Using a concept called embeddings, the Pythia system allows the input information to be encoded, in this case the list of questions and images, to then be run through a neural network.
The system will encode both the questions and images and combine them to figure out a model that will generate accurate answers. An intuitive attention mechanism is then used to focus on different sections of the input to effectively learn the most important parts, which will then help the system decide the best predictions to make.
Text generation for web publishers
Text Summarisation is the practice of generating summaries from a large volume of text. There are two approaches for text summarisation:
- Abstractive - This can be used to generate novel sentences. While this approach is more intuitive it also makes more mistakes.
- Extractive - A system that is able to rank sentences contained within the text based on how effective they will be as a summary for the whole article.
When deciding on which approach to put into practice, Hamlet recommends reviewing the results to see how well the model is performing. In the case below, BERTSUM+Transformer is generating the best results for extractive text summarisation.
Building an extractive text summarization model
For building an extractive text summarisation model, Hamlet explored an approached using BERTSUM, a modified version of the BERT model that has been designed specifically for text summarisation. BERT is a pre-trained Transformer which is consistently achieving ground-breaking performance wins when undertaking multiple NLP tasks.
Following the Fine Tune BERT for Extractive Summarisation Model, the first step is to open the github link contained within the paper. Follow this by downloading the processed data needed for the training, before opening the repository within Google Colab. You then need to include the training data within your Colab notebook, along with the python script contained within the github readme documentation to start the training of the system.
In order to produce the best results, this model is required to run for 50,000 steps with one GPU from Google Colab, which will expand over 2 days. It’s also important to note that Colab disconnects, which could lead to you losing the data that has previously been collected. Hamlet therefore recommends saving the progress within a directory in Google Drive, to ensure it is saved outside of the Colab notebook. This will allow you to resume from the training step where the process disconnected.
The output will then generate both the gold summary and the candidate summary which has been generated by the machine after the training steps, these results can then be used to scale your efforts with text summarization. The gold summary is the reference summary, it is provided as the "ground truth". The goal is to match this when generating the candidates.
Question and answering
Using a question and answering model will enable you to generate metadata based on a small amount of text contained on a page. A state of the art model called XLNet, which uses an approach based on permutations, is able to intelligently predict the best approach based on all of the possible combinations.
Loving this @DeepCrawl webinar with @hamletbatista - sharing tips and tricks for using #python, open data sets, pre-trained image classifiers, #BERT for text understanding and generation, #XLNet for Q&A... Early morning nerdgasm happening here. pic.twitter.com/7waEfQ4KUc
— MichelleRobbins (@MichelleRobbins) August 7, 2019
Similar to the approach used for generating text from images, question and answering should be complemented with templates, as they should be kept short. The next exciting challenge, which was recently launched by Facebook, is creating algorithms which can answer long form questions.
Hear more from Hamlet in our upcoming Q&A post
The audience asked so many brilliant questions during the webinar that Hamlet wasn’t able to answer them all at the time. Don’t worry if your question wasn’t answered though, we have sent all the remaining questions to Hamlet who will answer them all for a Q&A post which will be coming soon to the DeepCrawl blog.
Get started with DeepCrawl
To learn more about any of the methods for automated text generation discussed in this webinar, you can find Hamlet’s recommended resources in his slides here. Plus, if you’re interested in learning about how DeepCrawl can help by finding missing metadata and crawling images to export for captioning, then you can take advantage of our two week no-obligation free trial to get started.