Feeding the machine: we give an AI some headlines and see what it does


Turning the lens towards ourselves, so to speak.
Enlarge / Turning the lens towards ourselves, so to speak.

There is a moment in any foray into new technological territory where you realize that you may have embarked on a Sisyphus task. Looking at the multitude of options available to take on the project, you research your options, read the documentation, and go to work, only to find that you actually only defining the problem it may be more work than finding the real solution.

Reader, this is where I found myself two weeks into this adventure in machine learning. I became familiar with the data, tools, and known approaches to problems with this type of data, and tried various approaches to solving what on the surface appeared to be a simple machine learning problem: based on past performance, could we predict whether will any Ars holder be a winner in an A / B test?

Things have not gone particularly well. In fact, when I finished this piece, my most recent attempt showed that our algorithm was as accurate as the flip of a coin.

But at least that was a start. And in the process of getting there, I learned a lot about data cleansing and pre-processing that goes into any machine learning project.

Preparing the battlefield

Our data source is a record of the results of more than 5,500 headline A / B tests over the past five years; That’s roughly how long Ars has been doing this kind of headline shooting for every story that gets published. Since we have labels for all of this data (i.e. we know if you won or lost your A / B test), this would appear to be a supervised learning disability. All I really needed to do to prepare the data was make sure it was formatted properly for the model I chose to use to create our algorithm.

I’m not a data scientist, so I wasn’t going to build my own model at any point this decade. Fortunately, AWS provides a number of prebuilt models suitable for the task of word processing and specifically designed to operate within the confines of the Amazon cloud. There are also third-party models, such as Hugging face, which can be used within the SageMaker Universe. Each model seems to need to be supplied with data in a particular way.

The choice of model in this case largely comes down to our approach to solving the problem. Initially, I saw two possible approaches to train an algorithm to get a probability of success from any headline:

  • Binary classification: We simply determine the probability that the headline will land in the “win” or “lose” column based on the previous winners and losers. We can compare the probability of two incumbents and choose the strongest candidate.
  • Multiple category classificationWe try to rank the headlines based on their click-through rate in various categories, for example, ranking them from 1 to 5 stars. We could then compare the scores of the incumbent candidates.

The second approach is much more difficult, and there is a general concern with either method that makes the second even less sustainable: 5,500 tests, with 11,000 headlines, is not a lot of data to work with in the grand AI / ML scheme of things.

So I opted for binary classification for my first attempt, because it seemed the most likely to be successful. It also meant that the only data point I needed for each title (other than the title itself) is whether it won or lost the A / B test. I took my source data and reformatted it into a comma separated values ​​file with two columns. : titles in one and “yes” or “no” in the other. I also used a script to remove all the HTML markup from the headlines (mainly some HTML tags for italics). With the data reduced to almost essentials, I loaded it into SageMaker Studio so that I could use the Python tools for the rest of the setup.

Next, I needed to choose the type of model and prepare the data. Again, much of the data preparation depends on the type of model in which the data will be entered. Different types of natural language processing Models (and problems) require different levels of data preparation.

After that comes “tokenization”. AWS Tech Evangelist Julien Simon explains it this way: “Data processing must first replace words with tokens, individual tokens.” A token is a machine-readable number that represents a string of characters. “So ‘ransomware’ would be word one,” he said, “‘criminals’ would be word two, ‘configuration’ would be word three … so a sentence becomes a sequence of tokens, and you can feed that to a learning model and let him learn which are the good guys, which are the bad guys. “

Depending on the particular problem, you may want to get rid of some of the data. For example, if we tried to do something like sentiment analysis (ie determining if a given Ars title has a positive or negative tone) or grouping the titles by subject, you would probably want to trim the data to the most relevant content by removing the “stop words”, common words that are important for grammatical structure , but they don’t tell you what the text actually says (as most articles).

nltk). Note that punctuation is sometimes packed with words as symbols; this would need to be cleaned up for some use cases. “>Tokenized headlines without stopwords via Python's Natural Language Toolkit (<code>nltk</code>).  Note that punctuation is sometimes packed with words as symbols;  this would need to be cleaned up for some use cases.  “src =” https://cdn.arstechnica.net/wp-content/uploads/2021/07/ai-ml-pt2-Picture1-640×527.png “width =” 640 “height =” 527 “srcset =” https : //cdn.arstechnica.net/wp-content/uploads/2021/07/ai-ml-pt2-Picture1.png 2x”/></a><figcaption class=
Enlarge / Tokenized headlines without stopwords via Python’s Natural Language Toolkit (nltk). Note that punctuation is sometimes packed with words as symbols; this would need to be cleaned up for some use cases.

However, in this case, stopwords were potentially important parts of the data; after all, we are looking for headline structures that attract attention. So I chose to keep all the words. And on my first training attempt, I decided to use BlazingText, a word processing model that AWS demonstrates in a classification problem similar to the one we are attempting. BlazingText requires that the “tag” data, the data indicating a particular fragment of the text’s classification, be preceded by “__label__“And instead of a comma delimited file, the label data and the text to be processed are placed on a single line in a text file, like this:

Data prepared for the BlazingText model, with headlines forced to lowercase.
Enlarge / Data prepared for the BlazingText model, with headlines forced to lowercase.

Another part of data preprocessing for supervised machine learning is dividing the data into two sets: one for training the algorithm and one for validating its results. The training data set is usually the largest set. Validation data is generally created from around 10 to 20 percent of the total data.

There is it’s been a lot of research on what is really the correct amount of validation data; Some of that research suggests that the sweet spot is more related to the number of parameters in the model that are used to create the algorithm than to the overall size of the data. In this case, since the model had to process relatively little data, I assumed my validation data would be 10 percent.

In some cases, you may want to retain another small group of data to test the algorithm later is validated. But our plan here is to eventually use live Ars headlines for testing, so I skipped that step.

To do my final data preparation, I used a Jupyter Notebook—An interactive web interface for a Python instance – to convert my two-column CSV to a data structure and process it. Python has some decent data manipulation and data science toolkits that make these tasks pretty straightforward, and I used two in particular here:

  • pandas, a popular data manipulation and analysis module that does wonders at cutting and cutting CSV files and other common data formats.
  • sklearn (or scikit-learn), a data science module that takes much of the heavy lifting out of machine learning data preprocessing.
  • nltk, the Natural Language Toolkit, and specifically, the Punkt Phrase tokenizer to process the text of our headlines.
  • The csv module to read and write CSV files.

Here is some of the code in the notebook that I used to create my training and validation sets from our CSV data:

I started using pandas to import the data structure of the CSV created from the initially cleaned and formatted data, calling the resulting object “dataset”. Using the dataset.head() The command gave me a look at the headings for each column that had been fetched from the CSV, along with a look at some of the data.

The pandas module allowed me to bulk add the string “__label__“to all the values ​​in the label column as required by BlazingText, and I used a lambda function to process headlines and force all words to lowercase. Finally, I used the sklearn module to split the data into the two files that it would send to BlazingText.


arstechnica.com

Leave a Reply

Your email address will not be published.