3

I would need to better understand how can be created a machine learning algorithm from scratch using an own model developed based on boolean values, for example # of words in a text, # of punctuation, # of capital letters, and so on, to determine if a text is formal or informal. For instance: I have

Text
there is a new major in this town
WTF?!?
you're a great person. Really glad to have met you
I don't know what to say
BYE BYE BABY

I created some rules to assign a label on this (small) train dataset, but I would need to understand how to apply these rules to a new dataset (test):

  • if there is an upper case word then I;
  • if there is a short expression, like don't, 'm ,'s, ... , then I;
  • if there are two symbols (punctuation) close to each other, then I;
  • if a word is in list of extra words, then I;
  • otherwise F.

Suppose that I have a dataframe to test and to assign these labels (I or F):

FREEDOM!!! I don't need to go to school anymore
What are u thinking?
Hey men!
I am glad to hear that. 

how could I apply my model to this new dataset, adding labels?

Test                                                  Output
FREEDOM!!! I don't need to go to school anymore       I
What are u thinking?                                  I
Hey men!                                              I
I am glad to hear that.                               F

Update after mnm's comment:

Would it be considered a machine learning problem the following one?

import pandas as pd
import numpy as np
data = { "ID":[1,2,3,4],
        "Text":["FREEDOM!!! I don't need to go to school anymore",
    "What are u thinking?",
    "Hey men!","
    I am glad to hear that."]}

here there should be the part of modelling

df['upper'] = # if there is an upper case word then "I" df['short_exp'] = # if there is a short exp then "I" df['two_cons'] = # if there are two consecutive symbols then "I"

list_extra=['u','hey'] df['extra'] = # if row contains at least one of the word included in list_extra then 'I'

append cols to original dataframe

df_new = df df_new['upper'] = df1['upper'] df_new['short_exp'] = df1['short_exp']

and similar for others

It is not clear, however, the latest part, that one based on condition. How can I predict the new values for the other texts?

LdM
  • 165
  • 9

1 Answers1

2

What you are proposing is a heuristic method, because you define the rules manually in advance. From a Machine Learning (ML) point of view the "training" is the part where you observe some data and decide which rules to apply, and the "testing" is when you run a program which applies these rules to obtain a predicted label. As you correctly understood, the testing part should be applied to a test set made of unseen instances. The instances in the test set should also be manually labelled (preferably before performing the testing in order to avoid any bias), so that you can evaluate your method (i.e. calculate the performance).

Technically you're not using any ML approach here, since there is no part where you automatically train a model. However heuristics can be useful, in particular they are sometimes used as a baseline to compare ML models against.


[addition following comment]

I think most of common pre-processing approach requires to convert text into lower case, but a word, taken in different contest, can have a different weight.

This is true for a lot of tasks in NLP (Natural Language Processing) but not all of them. For example for tasks related to capturing an author's writing style (stylometry) one wouldn't usually preprocess text this way. The choice of the representation of the text as features depends on the task so the choice is part of the design, there's no universal method.

how to train a model which can 'learn' to consider important upper case words and punctuation?

In traditional ML (i.e. statistical ML, as opposed to Deep Learning), this question is related to feature engineering, i.e. finding the best way to represent an instance (with features) in relation with the task: if you think it makes sense for your task to have specific features to represent these things, you just add them: for instance you can add a boolean feature which is true if the instance contains at least one uppercase word, a numeric feature which represents the number of punctuation signs in the instance, etc.

Recent ML packages propose standard ways to represent text instances as features and it's often very convenient, but it's important to keep in mind that it's not the only way. Additionally nowadays Deep Learning methods offer ways to bypass feature engineering so there's a bit of a tendency to forget about it, but imho it's an important part of the design, if only to understand how the model works.

Erwan
  • 26,519
  • 3
  • 16
  • 39