3

I've collected around half a million of unmarked comments from a newspage. The newspage has anti-foreign background. Because of that, a relativley high number of comments contain hate.

Do you have an idea how to start creating a dictionary out of the comments? The dictionary shall contain words, which are connected to hate. With the dictionary I want to perform a text analysis for hate, later.

Because there is no such dictionary for German language I tried some things. For instance, I checked how often words, which are connected to emotions, appear in the comments. But it turned up, that they seem to be contained as often as any other word. I also thought about a Bag-of-Words and some other things, but I don't know.

I'm doing the analysis with Python 3 and R.

Kind regards and thanks in advance!

So S
  • 281
  • 3
  • 11

3 Answers3

4

When looking on the probability of word occurrence, you will get stop words and other popular words. You are interested in words that appear more in the comments (assumed hate related) than in normal use.

Get a neutral resource (e.eg., a German newspaper, the German wikipedia, maybe Google ngrams for German). Compute the probability on the neutral source $P_{neutral}(word)$, the probability of the comments $P_{comments}(word)$ and look for the words of lift $\frac{P_{comments}(word)}{P_{neutral}(word)} > 1$. These are the words that are more popular at the comments.

As @chi wrote, many repositories can both give you a head start and help you tune the needed lift threshold (you might want words that appear much more often in the comments).

After this phase you might need do do a finer analysis. For example, I guess that there will be politicians names that will appear more often in the comments. See here for a possible approach.

DaL
  • 2,663
  • 13
  • 13
2

You could restate your problem as a text classification "hate vs neutral or compassion". The standard text classification methods then apply. Get yourself a neutral or "compassion" corpus and label their elements as such. Then run a classification learner pipeline. It's features dictionary for the "hate" category will be what you are looking for.

If that does not work out of the box or you don't have contrasting corpus, you could try to emulate the classifier and do the selection manually. Run the texts through a vectorizer with German stopwords, try both TfidfVectorizer and CountVectorizer. Then sort their resulting dictionary by the weight descending and just collect the words manually.

Diego
  • 550
  • 2
  • 8
2

It is not completely clear whether your dataset has any kind of mark up (like 'comment', 'neutral', 'positive'), and, from my point of view and experience, to get a quite precise dictionary of any kind, you should take human insights as source and stick with supervised learning algorithms.

If your dataset does contain such information, you may use Dan Levin's approach which seems quite promising and probability-wise comprehensive.

Alternatively to it, you may use advanced vector-space representations of words (word2vec) in a following manner:

  • train a word2vec model on large German text bank. That text shouldn't be of any specific character though you would benefit from presence of all your comment-words in that dataset.
  • then, for particular words in your hate-comments, find similar words using word2vec model. In that way, given your word2vec text bank will be representative enough, you may get a dictionary even richer that the one people tend to use in comments.

Anyway, keep us posted on achieved results :)

chewpakabra
  • 779
  • 4
  • 13