Episode 2: résoudre une tâche d'Aspect Based Sentiment Analysis - Analyse exploratoire
📂 Episodes précédents
Sur la même thématique:
- 'La pizza était délicieuse mais le service était déplorable': genèse de l'analyse de sentiment basée sur l'aspect
- Modélisation en Aspect Based Sentiment Analysis
Série résoudre une tâche d’Aspect Based Sentiment Analysis:
- Episode 1: résoudre une tâche d'Aspect Based Sentiment Analysis - Préliminaires
- Episode 2: résoudre une tâche d'Aspect Based Sentiment Analysis - Analyse exploratoire
🌴 Introduction
Après deux bonnes semaines de pause passées à se prélasser au bord d’une piscine d’un hôtel 5 étoiles aux Bahamas,
nous pouvons enfin reprendre les choses sérieuses.
Au menu d’aujourd’hui, nous allons effectuer une première analyse exploratoire. L’objectif?
Rien de sorcier, à savoir déterminer quelques statistiques basiques sur les topics. Enfin je ne vous spoile pas tout tout de suite…
Tout le travail qui suit a été effectué dans un environnement jupyter.
🚀C’est parti
Pour cette étude nous n’avons besoin que de deux librairie: pandas (pour charger les données et calculer des stats), puis altair (pour la data-viz). Définissons le chemin vers les données d’entraînement.
import pandas as pd
import altair as alt
TRAIN_PATH = "SemEval2017-task4-dev.subtask-BD.english.INPUT.txt"
A quoi ressemblent les données?
data = pd.read_csv(train_path, sep="\t", header=None)
data.head()

Passons rapidement sur le préprocessing initial, qui nous permet d’obtenir les données suivantes:

Comme lorsque l’on fait ses courses, on se pose en premier lieu une liste de questions:
- Combien de topics y-a-t-il? Est-ce que certains se ressemblent?
- Apparaissent-t-il de façon explicite dans chaque tweet?
- Y-a-t-il des cas de tweets multi-topics?
- Combien y a-t-il d’exemples de chaque classe (positif/négatif) au global? et pour chaque topic?
- Les tweets partagent-ils des caractéristiques communes, ont-ils quelque chose de particulier?
Nombre et nature des topics
>>> topics = data["topic"].unique()
>>> len(topics)
>>> 100
>>> topics
>>> array(['amy schumer', 'ant-man', 'bad blood', 'bee gees', 'big brother',
'boko haram', 'briana', 'brock lesnar', 'caitlyn jenner',
'calibraska', 'carly fiorina', 'cate blanchett', 'charlie hebdo',
'chris evans', 'christians', 'chuck norris', 'curtis',
'dana white', 'dark souls', 'david bowie', 'david price',
'david wright', 'dean ambrose', 'dunkin', 'dustin johnson',
'ed sheeran', 'eid', 'floyd mayweather', 'foo fighters',
'frank gifford', 'frank ocean', 'gay', 'george harrison',
'george osborne', 'gucci', 'hulk hogan', 'ice cube', 'ira', 'iran',
'iron maiden', 'islam', 'israel', 'janet jackson', 'jason aldean',
'john cena', 'john kasich', 'josh hamilton', 'justin bieber',
'kane', 'kanye west', 'katy perry', 'kendrick', 'kendrick lamar',
'kim kardashian', 'kpop', 'kris bryant', 'lady gaga', 'milan',
'miss usa', 'moto g', 'murray', 'muslims', 'naruto',
'national hot dog day', 'national ice cream day', 'niall', 'nicki',
'nirvana', 'paper towns', 'paul dunne', 'paul mccartney',
'prince george', 'ps4', 'rahul gandhi', 'randy orton',
'real madrid', 'red sox', 'rolling stone', 'rousey', 'ryan braun',
'sam smith', 'saudi arabia', 'scott walker', 'seth rollins',
'sharknado', 'shawn', 'star wars day', 'super eagles', 'the vamps',
'thor', 'tom brady', 'tony blair', 'twilight', 'u2', 'watchman',
'white sox', 'yakub', 'yoga', 'zac brown band', 'zayn'],
dtype=object)
- On dénombre 100 topics
- Plusieurs catégories de topics sont concernées:
- Des personnalités réelles ou fictives (Amy Schumer, Naruto…)
- Des groupes de musique (Bee gees, Foo fighters …)
- Des thèmes de société et de religion (gay, islam, christians)
- Des inclassables (Star wars day, Moto g, PS4…)
- Deux topics partagent le même sujet: islam & muslims
Les topics sont-ils explicites?
>>> data.apply(lambda row: row.topic in row.tweet_text.lower(), axis=1).sum() / len(data)
>>> 0.9997
- Oui, dans 99,9% des tweets on retrouve le topic dans le texte (ce dernier mis en lowercase). Déterminer le topic n’est donc pas un défi.
Y-a-t-il des cas de tweets multi-topics?
>>> (data.groupby("tweet_id").topic.count() > 1).sum()
>>> 16
- Très peu (16 pour 10k tweets environ)
Combien y a-t-il d’exemples de chaque classe (positif/négatif) au global? et pour chaque topic?
a. Global
aggregated_data = data.groupby("polarity", as_index=False).tweet_id.count()
count_positive = aggregated_data.loc[aggregated_data.polarity == "positive", "tweet_id"].values[0]
count_negative = aggregated_data.loc[aggregated_data.polarity == "negative", "tweet_id"].values[0]
alt.Chart(
aggregated_data,
title=f"Negative to positive class ratio: {count_negative / count_positive:.2f}"
).encode(
x=alt.X("tweet_id:Q", title="Count of records"),
color=alt.Color("polarity:N", scale=alt.Scale(range=["red", "green"]), title="Polarity", legend=None),
row=alt.Row("polarity:N", title="Polarity"),
tooltip=[alt.Tooltip("tweet_id:Q", title="Count"), alt.Tooltip("polarity:N", title="Polarity")]
).mark_bar()
Au global, voici la répartition de la polarité (figure interactive):
On observe une tendance déséquilibrée de la polarité en faveur de la classe positive, le nombre de tweets négatif étant environ de 30% le nombre de tweets positifs.
b. Topic
alt.Chart(
data.groupby(["topic", "polarity"], as_index=False).tweet_id.count(),
title=alt.TitleParams(
f"Tweet count by polarity for the {data.topic.nunique()} topics in the dataset",
anchor="start"
)
).encode(
x=alt.X("topic:N", title="Topic", sort="-y"),
y=alt.Y("tweet_id:Q", title="Tweet count", stack=True),
color=alt.Color("polarity:N", scale=alt.Scale(range=["red", "green"]), title="Polarity"),
tooltip=[alt.Tooltip("tweet_id", title="Count"), "topic:N"]
).mark_bar(
opacity=0.7
)
Augmentons la granularité de ces résultats en affichant les statistiques pour chaque topic:
💡C’est une carte interactive, survolez-là avec le curseur et scrollez pour voir l’intégralité des résultats
On constate que selon les topics, la répartition positive/négative de la polarité est très variable. On retrouve notemment:
- Des sujets faisant la quasi-unanimité en positif (ex. foo fighters) ou en négatif (ex boko haram)
- Des sujets équilibrés (ex. kanye west)
- Des sujets à parti pris en positif ou négatif, mais pas totalement déséquilibrés (ex. seth rollins)
On remarque également que le nombre de tweets par topic est très variable, qu’il peut aller de 25 à 231. On a donc de grandes disparités de classes.
Les tweets partagent-ils des caractéristiques communes, ont-ils quelque chose de particulier?
- Prenons un échantillon de 0,5% des données (environ 50 lignes). On obtient la table suivante:
ID | Topic | Polarity | Text |
---|---|---|---|
637783304431792128 | hulk hogan | negative | Seeing Hulk Hogan is just coming off a racist tirade, endorsing #Trump may not be his wisest move. |
677504063219494916 | saudi arabia | negative | They Still Have Slaves - My 1st &only Thought - When a So-Called Prince from Saudi Arabia - Says ANYTHING about Donald Trump - |
637468707145543680 | jason aldean | positive | Seeing Big Green tractor be performed by Jason Aldean tonight threw it back to 7th grade homeroom with @legendracer16 |
638481421846319105 | muslims | positive | Fort Wayne had 3rd Interfaith Prayers 4 the City. Hindu dancing, Miami Nation drums and chanting, Muslims, Jews, Christians all 2gether as 1 |
632466111682863104 | sam smith | positive | All I know is I'm going to see Sam Smith tomorrow.. Like I honestly do not care about anything else in this world.. |
677932796929662976 | iran | negative | Giving Iran nuclear missile technology will create global warming this may make even more terrorist lose their jobs so its ISIS i guess. |
631095444844716037 | janet jackson | positive | Have you ever been told you look like Janet Jackson? Win the chance to watch Janet from a VIP Suite, October 29th... http://t.co/Id9xsuBQoC |
640641873808334849 | foo fighters | positive | Could not be happier about the fact that I'm seeing the Foo Fighters on Tuesday. |
640993112941137920 | niall | positive | it's going to be an year since my wwa concert on sunday, Niall's birthday )): |
636639645435080704 | nirvana | positive | i'm listening to nirvana and drinking wine from a mug how's your wednesday night going |
663344328677437440 | ira | negative | Tho I respect his decision, I don't agree with it. James McClean may not necessarily be a supporter of IRA. He said the poppys stand for ... |
641026836223275008 | david wright | positive | David Wright really showed up in the 7th inning of the @Mets win today. Had the go-ahead RBI and a crucial run scored. Easy guy to root for. |
678381028629741568 | kendrick lamar | positive | Kendrick Lamar's Black Friday is prolly the nicest thing he's done recently |
635148391878631424 | kane | positive | no pace or creativity,defenders know how to play against kane and lloris has saved us too many times, 3rd week in a row with only 1 CF #THFC |
681616963886514176 | israel | positive | I love Israel and I will always support you, come what may. https://t.co/hGyUOOyhqG |
628688337931472896 | floyd mayweather | positive | It's official Floyd Mayweather vs Andre Berto September 14th MGM Grand, it is what it is I guess #boxing |
681879579129200640 | iran | positive | Iran ranks 1st in liver surgeries, Allah bless the country. |
624283749132468225 | national hot dog day | positive | Today is National Hot Dog Day. I feel that since it is also two-slice Thursday, @VinnysPizza594 should've had a chili cheese hot dog pizza. |
634735567444406272 | rousey | positive | @gma broke the story for @RondaRousey fight. As if Rousey couldn't get cooler, she is bringing MMA mainstream https://t.co/liMDO9XJho |
637610578219892736 | islam | negative | @EchoOfIndia Also, his anger against Hindus are justified but couldn't get why he was so anti Islam..may be he was just fed up of religions |
633414773028159488 | caitlyn jenner | negative | What if we make a celebrity death match between Bruce Jenner vs Caitlyn Jenner??..... #OfficeTalk #Monday #Hollywood |
640702783193247745 | red sox | positive | Best day watching some Baseball in the sun. Go Red Sox! #fenwaypark #boston #redsox @ Fenway Park https://t.co/LLE8g1Yb4T |
681153118592176128 | iran | negative | Either live within the laws of Nigeria or you take a hike to Iran where you have publicly pledged your allegiance to https://t.co/hvi0DBVZia |
630482719814909954 | frank gifford | positive | We will miss you Frank Gifford, NFL Legend & hubby of @KathieLGifford Now walking in Glory & Wonder with God Himself https://t.co/VYQgAPYlPF |
637643688835944448 | foo fighters | positive | It's Saturday!!! Bring on the Foo Fighters tonight!!! https://t.co/4VgTALJRTY |
640821210041835520 | kane | positive | @THNMattLarkin Kane, Karlsson, Getzlaf, Couture available for my 1st pick. Who would you take? I want Forsberg and/or Buff. |
667497271806816256 | curtis | positive | @Nug33ent @RoccYaWorld he told us the summer BEFORE 9th grade. Nice try though Curtis |
630822206897938433 | sam smith | positive | I can't even remotely believe that I'm going to see Sam Smith in person, with my own two eyes, this Friday. |
639910156495622145 | milan | positive | @HoldTheMilan Wow Kovacic and Witsel in January? Can see Milan the Scudetto winners already. |
640950792115982336 | red sox | positive | Blue Jays 3RD Josh Donaldson deserves the AL MVP for sure. And that's coming from a Red Sox fan. That man is fun to watch. |
673891764844015616 | kendrick | positive | Super-happy for Kendrick today but the reality is he'll lose 90% of those nominations in February and then it's back to hating the Grammys. |
676973345695510528 | twilight | positive | Are you using your smart phone or tablet in the late evening? Twilight may be a solution for you! R https://t.co/MRFlABJuDf |
678242021614727168 | milan | positive | Nightlife in Milan, Italy: Aperitif and evening at Fiat Open Lounge - english page - https://t.co/Xg195rDYps For the ones tomorrow sunda... |
640257334124658688 | david price | positive | David Price pitched the #BlueJays past the #Orioles to earn his 100th career win on Saturday. http://t.co/EwbfUxnqdd http://t.co/Q6gHNb9yXl |
639167797436809216 | janet jackson | positive | This the 4th time someone said I look like Janet Jackson , lol |
640729032447803393 | janet jackson | positive | Janet Jackson released more information about her new album "Unbreakable," which will be out on Oct 2, on her... http://t.co/kIMVLeR4oO |
677563786950152192 | star wars day | positive | Tomorrow! Finally! Here are 8 money lessons from the movie Star Wars Day: 8 Financial Lessons for May the Fourth https://t.co/w9vsyEBwXR |
625825394420260864 | dean ambrose | negative | Even if he wasn't a wrestler, I figure Dean Ambrose would still be drunkenly fighting someone every Monday night. |
674038805507080194 | dana white | negative | If he gets the crap beat out of him Saturday, Dana White may die. https://t.co/qUL45Gd5Lx |
640021351122649088 | ed sheeran | positive | Ed Sheeran concert at 7 PM tomorrow. Getting in line at like 9 lol. Need to get close to the stage bruh |
633482813401092096 | ice cube | positive | Ice Cube may be the realest rapper alive |
641429128739201025 | milan | positive | Tomorrow is the day, see you all soon! Event: FH Club x MYMILAN MILAN Luxury Clearance Sales Date: 10th-12th... http://t.co/ox4t9fPpse |
663115995167350785 | briana | negative | This is upsetting! you may not like Briana but spreading these false rumors that could ruin her life is horrible https://t.co/41Ufbrn4ON |
638045448696111105 | ed sheeran | positive | I can't wait to see Ed Sheeran on the 10th!!!!!!! |
634637060230500352 | nirvana | positive | Starting Friday on a great note. Just saw "Montage of Heck." I love the fact that "Nirvana " was my generation. #Nirvana |
636832107776618496 | ps4 | positive | Getting super hype for the tension filled #UntilDawn, PS4's adventure survival horror thriller which is out tomorrow. http://t.co/KxoNueBKQB |
634233332583149568 | sam smith | positive | I don't understand why they scheduled the Sam Smith concert on the 30th omg that's so long from now :-/ |
638313062865469440 | gucci | positive | If anyone can do my hair and makeup for homecoming the 12th that would be Gucci. I'll pay. |
640195601192366080 | carly fiorina | positive | @sunshinesplat Leaning Scott Walker actually. Liked some of Carly Fiorina. Ted Cruz would be my 3rd (albeit over the top personality) |
632383016292106240 | david price | negative | David Price lost a late lead against the Yankees in Friday's 4-3 defeat, allowing three runs on 11 hits over 7 1/3 innings. |
635874987941982208 | paul mccartney | positive | The big announcement tomorrow is not going to be anywhere near as huge as they're letting on. I think Paul McCartney probably makes sense. |
622932931896561664 | paul dunne | positive | Really hope Paul Dunne has a good day tomorrow. Fantastic player and great guy whatever happens tomorrow to lead the open is awesome! |
665465878381727744 | charlie hebdo | negative | @I_amSasquatch @Cairo67Unedited he tweeted that after the attack in Paris in January- the Charlie Hebdo thing |
On remarque les choses suivantes dans cet extrait:
- Présence de nombreuses URL
- Encodage de l’amperande (
&
) et de smileys (:-/
). On remarquera d’autres smileys dans le dataset entier. - Hashtags (
#<sujet>
), références (@<pseudo_twitter>
) - Abbréviation (“R” -> “?”, “2gether” -> “together”)
- Polarité positive malgré des tournures de phrases négatives (“Could not be happier”)
- Labelling contestable: "@EchoOfIndia Also, his anger against Hindus are justified but couldn’t get why he was so anti Islam..may be he was just fed up of religions" est considéré négatif envers le topic “islam”, alors que l’auteur du tweet se demande pourquoi une certaine personne serait contre cette religion
🧐 Et pour la suite?
Je ne sais pas vous, mais jusqu’ici je suis assez satisfait de la compéhension du jeu de données:
- On sait qu’on fait face à un bon déséquilibre de classes, tant dans les topics que dans la polarité
- Que pour l’écrasante majorité des cas, on a un seul topic concerné par tweet
- Le contenu textuel des tweets est sans doute à nettoyer pour virer les URL; pour remplacer les smileys et caractères encodés bizarrement; il y aura peut-être un travail à faire sur les hashtags ainsi que les références
Dans le prochain article, on se chargera de créer un premier modèle en faisant quelques larges hypothèses, avec pour idée de mettre en place la chaîne de pré-traitement / modélisation / évaluation de bout en bout.
Stay tuned!