Mapping Medium’s Tags
This was originally published Jan 18, 2018 on Hatch, Medium’s internal instance, to explain a hack week project to the company.
When you publish a post on Medium, you’re prompted to add labels to your post that describe what your post is about. These tags are mostly free-form. Authors can write whatever they think describes their post.
Adding some tags to a Medium post.
As far as data goes, these tags are a gold mine. Authors are labelling their posts with a succinct word or phrase that other people understand. We can (and have) used these tags to inform our algorithms for showing content and organizing it.
However, there are big issues with tags that limit their usefulness. One of the issues is that tags are scattered. At this point, authors have defined over 1 million unique tags. Many tags are essentially duplicates of other tags, or are so close that they have the same audience. Here are some examples:
- Global Warming = Climate Change
- Hillary Clinton = Hilary Clinton (common misspelling)
- Poetry = Poem = Poems = Poetry on Medium
- Startup = Entrepreneurship = Startup Lessons = Founder Stories
To computers, each tag is just a string of text and by default they don’t have meaning or relatedness. This makes it hard for us to wield them cohesively in algorithmic battles.
The “Climate Change” tag is asked about the “Global Warming” tag
Tags as multi-dimensional characters
Instead of each tag just being represented by a string, what if we could represent it by its qualities and how it relates to other tags? When we talk about people, we don’t compare them by their names, rather we describe and compare them by the many qualities they have. People are “multidimensional” to us. What if tags were too?
We’re going to take the word “multidimensional” literally, and represent each tag by a vector of numbers in a multi-dimensional vector space.
First of all, what is a vector? For the sake of this project’s Python code, a vector is just a fixed-length array of numbers. However, you can interpret this list of numbers in different ways. You could interpret it as a point in space (e.g. (5,3) is the point that is offset 5 along the x-axis and 3 along the y-axis in 2-dimensional space). However, sometimes it’s more useful to interpret it as a vector with direction and magnitude, the ones you might remember from Physics class. It’s confusing. I recommend this short video explanation.
If we could represent tags with vectors then we could compare them by distance or plot them to visualize the clusters they form. Spoiler: we’re going to do just that.
Training the tag vectors
But how do we find the meaning of these thousands of tags in a way that can represented by vectors of numbers? We do this by training a machine learning model using our tag data.
The training data I used was the tags of 500,000 “reliable” public English Medium posts. I pretended each post’s tag list was a “sentence” (where each tag is a “word”), and fed those into a training algorithm which usually takes real sentences and learns vector representations of words. I used gensim’s word2vec
implementation for this, and I specified that these vectors should have 100 dimensions.
I won’t go into details about how the training algorithm works, but essentially it figures out a tag’s vector values by looking at the tags that are used along with it on posts. You can read more about the algorithm here.
One post’s set of tags. To figure out the vector for “Programming”, the algorithm takes into account that it was used with “Paper Review” and “Computer Science”, among many other examples.
Algorithms such as word2vec
are said to “embed” entities (like tags or words) in a multi-dimensional vector space, and as such, these kinds of vectors are also known as “embeddings”. If you’re googling for more information about all this, you’ll want to search for “embeddings”.
Examining the vectors
After a few minutes of training, we get the vectors for every tag at our disposal. Let’s check one out. Here’s the vector for “Climate Change”:
The 100-dimensional vector that represents the “Climate Change” tag
Great. Don’t worry, you shouldn’t understand what these numbers mean. I couldn’t even tell you myself. Unfortunately, it’s not even as simple as saying “the _X_th dimension represents Y quality and the value is how much the tag expresses that quality”. Rather, the dimensions work in concert to represent information about the tags.
It’s easier to see what’s going on by comparing vectors. One thing we can do is find tag vectors that are close to each other. Here, we interpret them as vectors with direction and magnitude in order to compute the cosine similarity between them:
The most similar tag vectors to “Climate Change” are Climates, Environmental Issues, Pollution, Environmental Justice, and Public Lands
The most similar to “Science Fiction” are SciFi, Dystopia, Star Trek, Aliens, and Time Travel
The most similar to “Education” are Higher Education, Teachers, Education Reform, Teaching, and Schools
Combining Vectors
We can also do arithmetic on our vectors to jump around the vector space. Here we average the “Tech” vector with “Education” to land in the vicinity of EdTech tags:
Averaging the “Education” and “Tech” vectors creates a vector which is closest to those of EdTech tags. Computer Science is among the closest and is a different interpretation of combining “Tech” and “Education”.
Now we can try “Fine Art” + “Cities” to get “Graffiti”:
Averaging the “Fine Art” and “Cities” vectors creates a vector which is closest to Graffiti, Street Art, Sculpture, Public Space, Placemaking.
“Programming” + “Gaming” is “Game Development”:
Averaging the “Programming” and “Gaming” vectors creates a vector which is closest to Game Development, Games, Indiedev, Gamedev, and Minecraft.
We can also solve analogies, like this one, which is essentially "Education" is to "EdTech" as "Agriculture" is to _
:
Averaging the “Agriculture” and “EdTech” vectors along with the negative of “Education” yields a vector which is closest to that of AgTech tags
The fact that we can perform this basic arithmetic and reliably get these results indicates that we’ve learned some “linear regularities” of the tags. People have found similar, but more linguistic, regularities in word embeddings obtained by training word2vec
on real words and language.
Visualizing the vectors
We can also plot the tag vectors. Now, we switch to interpreting them as points in space. However, since we can’t visualize points in 100-dimensional space, we’ll have to reduce them to two dimensions.
To do so, we don’t just take the first two dimensions of the vectors and call it a day. Rather, we try to preserve some information from all of the dimensions, and keep points which are close in the 100-dimensional space close in the 2d space. There are a myriad of ways to do this “dimensionality reduction”. One that is particularly good for visualization is t-SNE.
The plot of this 2d tag space is large, even if we’ve limited it to only tags which have more frequent usage, so we’ll take a closer look at some smaller regions of it.
The Medium Tag Universe:
A 2d plot of ~5000 tags and the regions we’ll be examining circled in red.
The “Pets” Solar System:
“Pets”, “Dog Training”, and “Animal Rescue” form a cluster along with related tags.
The “Music” Spiral Galaxy with “Podcast” Arm:
A cluster of tags related to all aspects of music, including “Beyonce”, “Jazz”, and “Spotify”. Podcast tags are near.