Text mining is the application of natural language processing techniques and analytical methods to text data in order to derive relevant information. in this following works, It shows how to collecting data from Twitter with Twitter Streaming API that allow us to capture tweets real-time filter.
In this study case I will use #WorldCup data to compare the popularity of 4 most popularity during Fifa World Cup Russia 2018 and most football player: Cristiano Ronaldo, Neymar,Lionel Messi and Luis Suarez, and to retrieve links to the news resoruces such as tweet,website,video,youtube etc.. In the first Part, I will explaing how to connect to Twitter Streaming API and how to get the data. In the second Part, I will explain and show how to structure the data for analysis, and in the last paragraph, And Finally I will explain how to filter the data and extract links from tweets.
— Part 1:Getting Start with Twitter API —
Understanding of Twitter API
- Create a twitter account if you do not already have one.
- Go to https://apps.twitter.com/ and log in with your twitter credentials.
- Click “Create New App”
- Fill out the form, agree to the terms, and click “Create your Twitter application”
- In the next page, click on “API keys” tab, and copy your “API key” and “API secret”.
- Scroll down and click “Create my access token”, and copy your “Access token” and “Access token secret”.
Create Twitter streaming API file to shows the result of realtime filter streaming
Next create, a file called twitter_streaming.py, and copy into it the code below. Make sure to enter your credentials into access_token, access_token_secret, consumer_key, and consumer_secret.
There are the outputs when execute the instruction from above
the output returns the value in JSON from which contain more than 100 keys in 1 tweet, I’ve been streaming for 2 hours to collect data form Twitter
Capturing and Reading the Data
In order to capture the data for the analysis. I collect by following command to store data in txt file
The data that we retrived is store in worldcup2018_twitter_data.txt which are JSON forfrt you can see that the tweet contain additional and more information example :
“text”</span>:”Lionel Messi, Marcus Rojo\u2019s goals as Argentina best Nigeria
— Part 2: Structured data and analysis —
import necessary library which contain
Read captured data from Txt File
Show the total captured tweet data
using print and len ( ) function to read all count tweet data that has
In This data/worldcup2018_twitter_data.txt we’ve capture totally 6008 tweets from twitter
Mapping capture tweet from JSON format fileform text file into data frame
this shows top 5 language that has been tweet the most tweets are in English(en) and second in Protugal(pt) and third in Spanish(es),French(fr) and Japanese(jp)
Drawing the Graph
In order to impliment the graph we use Mathplotlib library to draw the grph which has many kind of graph.In a simple implimentation I use bar graph for showing counting result from above and finding top 10 Languages from 60008 tweets that’s captured
Showing different result in different kind of graph in Pie Graph
Drawing a Graph for 10 countries that tweet about ‘#WorldCup2’
— Part 3:Text Mining and Extracting Link —
Our main goals in these text mining tasks are: compare the popularity of Cristiano Ronaldo, Luis Suarez Neymar programming languages and to retrieve programming tutorial links. We will do this in 3 steps:
- We will add tags to our tweets DataFrame in order to be able to manipualte the data easily.
- Target tweets that have “WorldCup” or “Fifa” keywords.
- Extract links from the relevants tweets
Defind the function to convert all text that contain Capital and mixing text to lower case also using search function to find word in column text
to show all ranking of popularity on football player from above
Specifying Relevant Tweet text
In this part I’ll try to specifying the keywor in order to match football players who were mention during the WorldCup 2018 with keywords ‘FiFa2018′ or’World Cup’ or ‘WorldCup’
Mapping keywords of Fifa2018 and Worldcup that appear in text <span style=color>relevant that take value True if the tweet has either “programming” or “tutorial” keyword, otherwise it takes value False.
Create Relevent to apply with words in text that appear on tweets
Showing Matching keyword and Football player name values that appear in Captured Tweet