End of the Ars AI headline experiment: we arrived, we saw, we used a lot of computing time


End of the Ars AI headline experiment: we arrived, we saw, we used a lot of computing time

Aurich Lawson | fake images

We may have bitten off more than we could chew, folks.

An Amazon engineer told me that when he heard what he was trying to do with the Ars headlines, his first thought was that we had chosen a deceptively difficult problem. He warned me that I had to be careful in setting my expectations correctly. If it was a real business problem … well, the best I could do was suggest rephrasing the “good or bad headline” problem to something less concrete.

That statement was the most concise and familiar way to frame the outcome of my intensive four-week part-time course on machine learning. As of this point, my PyTorch cores are not so much torches as they are garbage fires. Accuracy has improved slightly thanks to professional intervention, but I’m nowhere near implementing a solution that works. Today, as I’m supposedly on vacation visiting my parents for the first time in over a year, I sat on a couch in their living room working on this project and accidentally launched a model training job locally on the Dell laptop that I brought. , with a 2.4GHz Intel Core i3 7100U CPU, rather than the SageMaker copy of the same Jupyter laptop. The Dell crashed so hard that I had to pull out the battery to restart it.

But hey, if the machine isn’t necessarily learning, at least I am. We are almost at the end, but if this were a class assignment, my grade on the transcript would probably be “Incomplete”.

The gang tries some machine learning

Bottom line: I was given the headline pairs used for Ars articles over the past five years with data on the A / B test winners and their relative click-through rates. Then I was asked to use SageMaker from Amazon Web Services to create a machine learning algorithm to predict the winner in future pairs of headlines. I ended up going down some ML dead ends before consulting various Amazon sources for much-needed help.

Most of the pieces are in place to finish this project. We (more accurately, my “call a friend on AWS” lifeline) had some success with different modeling approaches, although the accuracy rating (just north of 70 percent) was not as definitive as one would like. I have enough to work with to produce (with some extra effort) an implemented model and code to run predictions on headline pairs if I take your notes and use the algorithms created as a result.

But I have to be honest: my efforts to reproduce that work both on my own local server and in SageMaker have failed. In the process of feeling my way through the complexities of SageMaker (including forgetting to turn off laptops, running automated learning processes which they later told me were for “business customers” and other mistakes), I have spent more AWS budget than I would be comfortable spending on an unfunded adventure. And while I understand intellectually how to implement the models that have resulted from all of this, I am still debugging the actual execution of that implementation.

At the very least, this project has turned out to be a very interesting lesson in all the ways that machine learning projects (and the people behind them) can fail. And the failure this time started with the data itself, or even with the question we decided to ask with it.

I can still get a working solution from this effort. But in the meantime, I’m going to share the dataset on my GitHub that I worked with to provide a more interactive component to this adventure. If you can get any better results, be sure to join us next week to poke fun at me in the live recap of this series. (More details on that at the end).

Shaper glue

After several iterations of tuning the Squeezebert model we used in our redirected attempt to train for headlines, the resulting set consistently obtained 66 percent accuracy in testing, slightly less than the promise above 70 percent suggested above.

This included efforts to reduce the size of the steps taken between learning cycles to fit the inputs, the “learning rate” hyperparameter being used to avoid overfitting or misfitting the model. We reduced the learning rate substantially, because when you have a small amount of data (like we do here) and the learning rate is too high, you will basically make bigger assumptions in terms of the structure and syntax of the dataset. Reducing that forces the model to adjust those jumps to baby steps. Our original learning rate was set at 2×10-5 (2E-5); We narrowed it down to 1E-5.

We also tested a much larger model that had been previously trained on a large amount of text, called DUTY (BERT with Enhanced Decoding with Unraveled Attention). DeBERTa is a very sophisticated model: 48 transformation layers with 1.5 billion parameters.

DeBERTa is so elegant that it has they surpassed humans in natural language comprehension tasks in the SuperGLUE landmark“The first model to do so.”

The resulting deployment package is also quite substantial: 2.9 gigabytes. With all that added weight of machine learning, we got up to 72 percent accuracy. Considering that DeBERTa is supposedly better than a human when it comes to detecting meaning within the text, this precision, as a famous nuclear power plant operator once said, “is not excellent, it is not terrible.”

Deployment death spiral

On top of that, the clock kept ticking. I needed to try to have my own version up and running to test with real data.

An attempt at local implementation did not go well, particularly from a performance perspective. Without a good GPU available, PyTorch jobs running the model and endpoint literally stopped my system.

So, I tried again to implement in SageMaker. I tried running the smaller SqueezeBert modeling job in SageMaker on my own, but it quickly got more complicated. Training requires PyTorch, the Python machine learning framework, as well as a collection of other modules. But when I imported the various required Python modules into my SageMaker PyTorch kernel, they did not clearly match despite updates.

As a result, parts of the code that worked on my local server crashed, and my efforts got bogged down in a dependency tangle. It was a problem with a version of the NumPy library, except when I forced a reinstall (pip uninstall numpy, pip install numpy -no-cache-dir), the version was the same and the error persisted. I finally fixed it, but then I ran into another error that prevented me from running the training job and prompted me to contact customer service:

ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateTrainingJob operation: The account-level service limit 'ml.p3.2xlarge for training job usage' is 0 Instances, with current utilization of 0 Instances and a request delta of 1 Instances. Please contact AWS support to request an increase for this limit.

To fully complete this effort, I needed to get Amazon to increase my quota, something I hadn’t anticipated when I started connecting. It’s an easy fix, but troubleshooting module conflicts took most of the day. And I ran out of time while trying to sidestep using the prebuilt model my expert help gave me, implementing it as a SageMaker endpoint.

This effort is now in overtime. This is where I would have been discussing how the model performed in tests with recent headline pairs, if I ever got to the model to that point. If I can finally do it, I will put the result in the comments and in a note on my GitHub page.


arstechnica.com

Leave a Reply

Your email address will not be published. Required fields are marked *