I had originally asked this question here, and I believe it was incorrectly marked as a duplicate. I will do my best here to clarify my question and how I believe it is unique.
Given the following example MultiIndex dataframe:
import pandas as pd
import numpy as np
first = ['A', 'B', 'C']
second = ['a', 'b', 'c', 'd']
third = ['1', '2', '3']
indices = [first, second, third]
index = pd.MultiIndex.from_product(indices, names=['first', 'second', 'third'])
df = pd.DataFrame(np.random.randint(10, size=(len(first)*len(second)*len(third), 4)), index=index, columns=['Val1','Val2',' Val3', 'Val4'])
Goal: I would like to retain a specific level=1 index (such as 'a') if the value of column 'Val2' corresponding to index value 1 in level=2 is greater than 5 for that level=1 index. Therefore, if this criteria is not met (i.e. column 'Val2' is less than or equal to 5 for index 1 in level=2), then the corresponding level=1 index would be removed from the dataframe. If all level=1 indices do not meet the criteria for a given level=0 index, then that level=0 index would also be removed. My previous post contains my expected output (I can add it here, but I wanted this post to be as succinct as possible for clarity).
Here is my current solution, the performance of which I'm sure can be improved:
grouped = df.groupby(level=0)
output = pd.concat([grouped.get_group(key).groupby(level=1).filter(lambda x: (x.loc[pd.IndexSlice[:, :, '1'], 'Val2']>5).any()) for key, group in grouped])
This does produce my desired output, but for a dataframe with 100,000's of rows, the performance is rather poor. Is there something obvious I am missing here to better utilize the under-the-hood optimization of pandas?