Data Science, General, Uncategorized

Your First Text Mining Project with Python in 3 steps

Every day, we generate huge amounts of text online, creating vast quantities of data about what is happening in the world and what people think. All of this text data is an invaluable resource that can be mined in order to generate meaningful business insights for analysts and organizations. However, analyzing all of this content isn’t easy, since converting text produced by people into structured information to analyze with a machine is a complex task. In recent years though, Natural Language Processing and Text Mining has become a lot more accessible for data scientists, analysts, and developers alike.

There is a massive amount of resources, code libraries, services, and APIs out there which can all help you embark on your first NLP project. For this how-to post, we thought we’d put together a three-step, end-to-end guide to your first introductory NLP project. We’ll start from scratch by showing you how to build a corpus of language data and how to analyze this text, and then we’ll finish by visualizing the results.

We’ve split this post into 3 steps. Each of these steps will do two things: show a core task that will get you familiar with NLP basics, and also introduce you to some common APIs and code libraries for each of the tasks. The tasks we’ve selected are:

  1. Building a corpus — using Tweepy to gather sample text data from Twitter’s API.
  2. Analyzing text — analyzing the sentiment of a piece of text with our own SDK.
  3. Visualizing results — how to use Pandas and matplotlib to see the results of your work.

Please note: This guide is aimed at developers who are new to NLP and anyone with a basic knowledge of how to run a script in Python. If you don’t want to write code, take a look at the blog posts we’ve put together on how to use our RapidMiner extension or our Google Sheets Add-on to analyze text.

Step 1. Build a Corpus

You can build your corpus from anywhere — maybe you have a large collection of emails you want to analyze, a collection of customer feedback in NPS surveys that you want to dive into, or maybe you want to focus on the voice of your customers online. There are lots of options open to you, but for the purpose of this post we’re going to use Twitter as our focus for building a corpus. Twitter is a very useful source of textual content: it’s easily accessible, it’s public, and it offers an insight into a huge volume of text that contains public opinion.

Accessing the Twitter Search API using Python is pretty easy. There are lots of libraries available, but our favourite option is Tweepy. In this step, we’re going to use Tweepy to ask the Twitter API for 500 of the most recent Tweets that contain our search term, and then we’ll write the Tweets to a text file, with each Tweet on its own line. This will make it easy for us to analyze each Tweet separately in the next step.

You can install Tweepy using pip:

pip install tweepy

Once completed, open a Python shell to double-check that it’s been installed correctly:

>>> import tweepy

First, we need to get permission from Twitter to gather Tweets from the Search API, so you need to sign up as a developer to get your consumer keys and access tokens, which should take you three or four minutes. Next, you need to build your search query by adding your search term to the q = ‘’ field. You will also need to add some further parameters like the language, the amount of results you want returned, and the time period to search in. You can get very specific about what you want to search for on Twitter; to make a more complicated query, take a look at the list of operators you can use the API to search with in the Search API introduction.

Fill your credentials and your query into this script:

## import the libraries
import tweepy, codecs

## fill in your Twitter credentials 
consumer_key = ‘your consumer key here’
consumer_secret = ‘your consumer secret key here’
access_token = ‘your access token here’
access_token_secret = ‘your access token secret here’

## let Tweepy set up an instance of the REST API
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

## fill in your search query and store your results in a variable
results = = "your search term here", lang = "en", result_type = "recent", count = 1000)

## use the codecs library to write the text of the Tweets to a .txt file
file ="your text file name here.txt", "w", "utf-8")
for result in results:

You can see in the script that we are writing result.text to a .txt file and not simply the result, which is what the API is returning to us. APIs that return language data from social media or online journalism sites usually return lots of metadata along with your results. To do this, they format their output in JSON, which is easy for machines to read.

For example, in the script above, every “result” is its own JSON object, with “text” being just one field — the one that contains the Tweet text. Other fields in the JSON file contain metadata like the location or timestamp of the Tweet, which you can extract for a more detailed analysis.

To access the rest of the metadata, we’d need to write to a JSON file, but for this project we’re just going to analyze the text of people’s Tweets. So in this case, a .txt file is fine, and our script will just forget the rest of the metadata once it finishes. If you want to take a look at the full JSON results, print everything the API returns to you instead:

This is also why we used codecs module, to avoid any formatting issues when the script reads the JSON results and writes utf-8 text.

Step 2. Analyze Sentiment

So once we’ve collected the text of the Tweets that you want to analyze, we can use more advanced NLP tools to start extracting information from it. Sentiment analysis is a great example of this, since it tells us whether people were expressing positive, negative, or neutral sentiment in the text that we have.

For sentiment analysis, we’re going to use our own AYLIEN Text API. Just like with the Twitter Search API, you’ll need to sign up for the free plan to grab your API key (don’t worry — free means free permanently. There’s no credit card required, and we don’t harass you with promotional stuff!). This plan gives you 1,000 calls to the API per month free of charge.

Again, you can install using pip:

pip install aylien-apiclient

Then make sure the SDK has installed correctly from your Python shell:

>>>from aylienapiclient import textapi

Once you’ve got your App key and Application ID, insert them into the code below to get started with your first call to the API from the Python shell (we also have extensive documentation in 7 popular languages). Our API lets you make your first call to the API with just four lines of code:

>>>from aylienapiclient import textapi
>>>client = (‘Your_app_ID’, ‘Your_application_key’)
>>>sentiment = client.Sentiment({'text': 'enter some of your own text here'})

This will return JSON results to you with metadata, just like our results from the Twitter API.

So now we need to analyze our corpus from step 1. To do this, we need to analyze every Tweet separately. The script below uses the io module to open up a new .csv file and write the column headers “Tweet” and “Sentiment”, and then it opens and reads the .txt file containing our Tweets. Then, for each Tweet in the .txt file it sends the text to the AYLIEN API, extracts the sentiment prediction from the JSON that the AYLIEN API returns, and writes this to the .csv file beside the Tweet itself.

This will give us a .csv file with two columns — the text of a Tweet and the sentiment of the Tweet, as predicted by the AYLIEN API. We can look through this file to verify the results, and also visualize our results to see some metrics on how people felt about whatever our search query was.

from aylienapiclient import textapi
import csv, io

## Initialize a new client of AYLIEN Text API
client = textapi.Client("your_app_ID", "your_app_key")

with'Trump_Tweets.csv', 'w', encoding='utf8', newline='') as csvfile:
	csv_writer = csv.writer(csvfile)
	csv_writer("Tweet", " Sentiment")
	with"Trump.txt", 'r', encoding='utf8') as f:
	    for tweet in f.readlines():
	    	## Remove extra spaces or newlines around the text
	    	tweet = tweet.strip()

	    	## Reject tweets which are empty so you don’t waste your API credits
	    	if len(tweet) == 0:

	    	## Make call to AYLIEN Text API
	    	sentiment = client.Sentiment({'text': tweet})

	    	## Write the sentiment result into csv file
	    	csv_writer.writerow([sentiment['text'], sentiment['polarity']])

You might notice on the final line of the script that when the script goes to write the Tweet text to the file, we’re actually writing the Tweet as it is returned by the AYLIEN API, rather than the Tweet from the .txt file. They are both identical pieces of text, but we’ve chosen to write the text from the API just to make sure we’re reading the exact text that the API analyzed. This is just to make it clearer if we’ve made an error somehow.

Step 3. Visualize your Results

So far we’ve used an API to gather text from Twitter, and used our Text Analysis API to analyze whether people were speaking positively or negatively in their Tweet. At this point, you have a couple of options with what you do with the results. You can feed this structured information about sentiment into whatever solution you’re building, which could be anything from a simple social listening app or a even an automated report on the public reaction to a campaign. You could also use the data to build informative visualizations, which is what we’ll do in this final step.

For this step, we’re going to use matplotlib to visualize our data and Pandas to read the .csv file, two Python libraries that are easy to get up and running. You’ll be able to create a visualization from the command line or save it as a .png file.

Install both using pip:

pip install matplotlib
pip install pandas

The script below opens up our .csv file, and then uses Pandas to read the column titled “Sentiment”. It uses Counter to count how many times each sentiment appears, and then matplotlib plots Counter’s results to a color-coded pie chart (you’ll need to enter your search query to the “yourtext” variable for presentation reasons).

## import the libraries
import matplotlib.pyplot as plt 
import pandas as pd
from collections import Counter
import csv 

## open up your csv file with the sentiment results
with open('your_csv_file_from_step_3', 'r', encoding = 'utf8') as csvfile:
	## use Pandas to read the “Sentiment” column,
df = pd.read_csv(csvfile)
	sent = df["Sentiment"]

## use Counter to count how many times each sentiment appears
## and save each as a variable
	counter = Counter(sent)
	positive = counter['positive']
	negative = counter['negative']
	neutral = counter['neutral']

## declare the variables for the pie chart, using the Counter variables for “sizes”
labels = 'Positive', 'Negative', 'Neutral'
sizes = [positive, negative, neutral]
colors = ['green', 'red', 'grey']
yourtext = "Your Search Query from Step 2"

## use matplotlib to plot the chart
plt.pie(sizes, labels = labels, colors = colors, shadow = True, startangle = 90)
plt.title("Sentiment of 200 Tweets about "+yourtext)

If you want to save your chart to a .png file instead of just showing it, replace on the last line with savefig(‘your chart name.png’). Below is the visualization we ended up with (we searched “Trump” in step 1).

Screenshot (261)

If you run into any issues with these scripts, big or small, please leave a comment below and we’ll look into it. We always try to anticipate any problems our own users might run into, so be sure to let us know!

That concludes our introductory Text Mining project with Python. We hope it gets you up and running with the libraries and APIs, and that it gives you some ideas about subjects that would interest you. With the world producing content on such a large scale, the only obstacle holding you back from an interesting project is your own imagination!

Happy coding!



Will Gannon

Content Marketing Intern @ AYLIEN A Classics graduate from UCD, Will is on our Content Marketing Team here at AYLIEN. Before joining us, Will worked in research before completing a Master’s in Digital Humanities at Trinity College, where he used NLP methods to index where Latin terms appear in English Literature.