testPassengerId = test.select('PassengerId').map(lambda x: x.PassengerId)
I want to select PassengerId column and make RDD of it. But .select is not working. It says 'RDD' object has no attribute 'select'
testPassengerId = test.select('PassengerId').map(lambda x: x.PassengerId)
I want to select PassengerId column and make RDD of it. But .select is not working. It says 'RDD' object has no attribute 'select'
'RDD' object has no attribute 'select'
This means that test is in fact an RDD and not a dataframe (which you are assuming it to be). Either you convert it to a dataframe and then apply select or do a map operation over the RDD.
Please let me know if you need any help around this.
Assuming you have an RDD each row of which is of the form (passenger_ID, passenger_name), you can do rdd.map(lambda x: x[0]). This is for a basic RDD
If you use Spark sqlcontext there are functions to select by column name.
If your RDD happens to be in the form of a dictionary, this is how it can be done using PySpark:
Define the fields you want to keep in here:
field_list =[]
Create a function to keep specific keys within a dict input
def f(x):
d = {}
for k in x:
if k in field_list:
d[k] = x[k]
return d
And just map after that, with x being an RDD row
rdd_subset = rdd.map(lambda x: f(x))