2

I currently have a DataFrame with a shape of (16280, 13). I want to assign values to specific rows in a single column. I was originally doing so with:

for idx, row in enumerate(df.to_dict('records')):
    instances = row['instances']
    labels = row['labels'].split('|')

    for instance in instances:
        if instance not in relevant_labels:
            labels = ['O' if instance in l else l for l in labels]

        df.iloc[idx]['labels'] = '|'.join(labels)

But this kept returning the SettingWithCopyWarning due to the last line. I tried changing this to df.loc[idx, 'labels'] = '|'.join(labels) which doesn't return the warning anymore but caused errors in the latter parts of my code.

I noticed that the sizes of the DataFrames were (16280, 13) when using iloc and (16751, 13) when using loc.

How can I prevent the warning from printing and get the same functionality as using iloc?

Sean
  • 2,890
  • 8
  • 36
  • 78

2 Answers2

2

You have multiple things we can improve here.

First, try not as possible to loop over a dataframe but use some tools provided by the pandas package. However, if not avoidable, looping on dataframe's rows are better done with the .iterrows() methods instead of .to_dict(). Keep in mind, if using iterrows, you should not modify your dataframe while iterating over.

Then, for the iloc/loc uses. Loc is using the key names (like a dictionary) although iloc is using the key index (like an array). Here idx is an index, not the name of the key, then df.loc[idx, 'labels'] will lead to some errors if the name of the key is not the same as its index. We can easily use both of them like the following : df.iloc[idx, : ].loc['labels'].

To illustrate the difference between loc and iloc :

df_example = pd.DataFrame({"a": [1, 2, 3, 4],
                           "b": ['a', 'b', 'a', 'b']},
                          index=[0, 1, 3, 5])

print(df_example.loc[0] == df_example.iloc[0])  # 0 is the first key, loc and iloc same results
print(df_example.loc[1] == df_example.iloc[1])  # 1 is the second key, loc and iloc same results
try:
    print(df_example.loc[2] == df_example.iloc[2])  # 2 is not a key, then it will crash on loc (Keyerror)
except KeyError:
    pass
print(df_example.loc[3] == df_example.iloc[3])  # 3 the third key, then iloc and loc will lead different results
try:
    print(df_example.loc[5] == df_example.iloc[5])  # 5 is the last key but there is no 6th key so it will crash on iloc (indexerror)
except IndexError:
    pass

Remember that chaining your dataframe will return a copy of your data instead of a slice : doc. That's why both df.iloc[idx]['labels'] and df.iloc[idx, : ].loc['labels'] will trigger the warning. If labels is your ith columns, df.iloc[idx, i ] won't trigger the warning.

Zelemist
  • 642
  • 3
  • 14
  • Thanks for the answer. However, I'm still getting the warning. :( Also, I was using `iterrows` but switched to `to_dict('records')` because I heard the latter is much more efficient than the former. Is it still recommended to use `iterrows`? – Sean Nov 10 '22 at 07:37
  • Ok, I was wrong about chaining loc and iloc, still having warning double check with the doc. I will edit my answer on that. In general, try avoid iterrows() or to_dict(), the performance difference is minimal and I prefer iterrows() since you don't have to call enumarate. Try refactor your code with apply if performance is an issue – Zelemist Nov 10 '22 at 08:11
  • I guess since your operations on labels is independant of the previous rows, you can easily vectorize your operations. You could use `explode` to transform your lists instances and labels as rows and `isin` to check if all your data in the columns instances is in relevant_labels. – Zelemist Nov 10 '22 at 08:27
  • I think adding `pd.options.mode.chained_assignment = None` after importing pandas will resolve your problem. – sadegh arefizadeh Nov 14 '22 at 06:58
  • I'm not sure I will advice any beginner to use this options. Indeed at None it will suppress the warning, not fixing it. It's great for performance (less check are made) and if you totally know what you're doing with chaining assignment, since chaining assignment leads to unpredictable behaviour – Zelemist Nov 15 '22 at 08:47
0

Please take note that in your case, SettingWithCopyWarning is a valid warning as the chained assigment is not working as expected. df.iloc[idx] returns a copy of the slice instead of a slice into the original object. Therefore, df.iloc[idx]['labels'] = '|'.join(labels) makes modification on a copy of the row instead of the row of the original df. It seems to happen when the dataframe has mixed datatypes.

Regarding the different results by .loc and .iloc, it is because your row label is different with row integer locations (probably due to a train test split). When a row label does not exist, .loc cannot find it in existing rows, so it generate new row (.loc gets row (and/or col) with row (and/or col) label, while .iloc gets row (and/or col) with integer locations.)

Please find the examples after the solutions.

Solutions

Basic idea: You should avoid chained assignments and use the correct labels/integer locations.

Solution 1: reset_index and .loc

If you don't need to keep the row index, a solution is to do reset_index before your code, and use your df.loc[idx, 'labels'] = '|'.join(labels).

import pandas as pd

df = pd.DataFrame({'instances': ["a", "b", "c", "d"],
                   'labels': [1, 2, 3, 4]},
                   index=[0, 2, 4, 5])
df

    instances   labels
0           a        1
2           b        2
4           c        3
5           d        4
df = df.reset_index(drop=True)
df

    instances   labels
0           a        1
1           b        2
2           c        3
3           d        4

This will make the dataframe row labels same as the row integer locations. So .loc[n, 'labels'] refers to the same thing as .iloc[n, 'labels'].

Solution 2: Use column integer locations of 'labels' and .iloc

Example: Update labels of the 4th row to 100

col_idx = df.columns.get_loc("labels")  # get the column integer locations of 'labels'
df.iloc[3, col_idx] = 100
df

    instances   labels
0           a        1
2           b        2
4           c        3
5           d      100

More Examples

Example of Valid SettingWithCopyWarning

import pandas as pd

df = pd.DataFrame({'instances': ["a", "b", "c", "d"],
                   'labels': [1, 2, 3, 4]},
                   index=[0, 2, 4, 5])
df

    instances   labels
0           a        1
2           b        2
4           c        3
5           d        4

Assume I want to update the labels of first row to 100.

df.iloc[0]['labels'] = 100
df

It returned the warning and failed to update the value.

/usr/local/lib/python3.7/dist-packages/pandas/core/series.py:1056: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cacher_needs_updating = self._check_is_chained_assignment_possible()

    instances   labels
0           a        1
2           b        2
4           c        3
5           d        4

If all columns have the same datatype (eg: all str, all int), iloc will work and won't return SettingWithCopyWarning. Apparently, pandas handles mixed-type and single-type dataframes differently when it comes to chained assignments. Referring to this post which points to this Github issue.

You can also read this post or pandas documentation to gain a better understanding on chained assignment.

Example of Additional Row by .loc

df

    instances   labels
0           a        1
2           b        2
4           c        3
5           d        4

The row labels in our example are (0, 2, 4, 5), while row integer locations are (0, 1, 2, 3). When you use .loc with a label that does not exist, it will create a new row.

df.loc[1, 'labels'] = 100
df

    instances   labels
0           a        1
2           b        2
4           c        3
5           d        4
1         NaN      100
wavingtide
  • 1,032
  • 4
  • 19