How to select particular column in Spark(pyspark)?

Question

testPassengerId = test.select('PassengerId').map(lambda x: x.PassengerId)

I want to select PassengerId column and make RDD of it. But .select is not working. It says 'RDD' object has no attribute 'select'

score 4 · Answer 1 · edited Oct 20 '16 at 09:24

4

You could try the following,

testPassengerID = test.select('PassengerID').rdd

this would select the column PassengerID and convert it into a rdd

edited Oct 20 '16 at 09:24

Stereo

1,423
9
24

answered Oct 20 '16 at 02:25

user25409

41
1

score 3 · Answer 2 · answered May 18 '16 at 09:52

'RDD' object has no attribute 'select'

This means that test is in fact an RDD and not a dataframe (which you are assuming it to be). Either you convert it to a dataframe and then apply select or do a map operation over the RDD.

Please let me know if you need any help around this.

score 3 · Answer 3 · answered May 18 '16 at 11:11

3

Assuming you have an RDD each row of which is of the form (passenger_ID, passenger_name), you can do rdd.map(lambda x: x[0]). This is for a basic RDD

If you use Spark sqlcontext there are functions to select by column name.

answered May 18 '16 at 11:11

wabbit

1,297
2
12
15

score 0 · Answer 4 · edited Nov 27 '17 at 16:26

If your RDD happens to be in the form of a dictionary, this is how it can be done using PySpark:

Define the fields you want to keep in here:

field_list =[]

Create a function to keep specific keys within a dict input

def f(x):
    d = {}
    for k in x:
        if k in field_list:
            d[k] = x[k]
    return d

And just map after that, with x being an RDD row

rdd_subset = rdd.map(lambda x: f(x))

How to select particular column in Spark(pyspark)?

4 Answers4