Questions tagged [apache-pig]

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs for which large-scale parallel implementations already exist (e.g. the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin which has the following key properties:

  • Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, easy to write and understand.
  • Optimization opportunities. The declarative way in which tasks are encoded permits the system to optimize their execution plan automatically, allowing the user to focus on semantics rather than efficiency.
  • Extensibility. Users can create their own functions to do special-purpose processing.

Official Website: https://pig.apache.org/

Useful Links:

7 questions
2
votes
0 answers

Pig Rank function not generating rank in output

I am facing this bizarre issue while using Apache Pig rank utility. I am executing the following code: email_id_ranked = rank email_id; store email_id_ranked into '/tmp/'; So, basically I am trying to get the following…
Ankit
  • 406
  • 2
  • 8
2
votes
1 answer

Error when using MAX in Apache Pig (Hadoop)

I am trying to calculate maximum values for different groups in a relation in Pig. The relation has three columns patientid, featureid and featurevalue (all int). I group the relation based on featureid and want to calculate the max feature value…
Akshay Gupta
  • 131
  • 3
2
votes
2 answers

unable to parse XML in pig

I have a XML file has this structure (not exactly a tree though)
sc3339
  • 21
  • 3
1
vote
0 answers

Pig is not able to read the complete data

I am trying to load a huge dataset of around 3.4 TB with approximately 1.4 million files in Pig on Amazon EMR. The operations on the data are simple (JOIN and STORE), but the data is not getting loaded completely, and the program is terminating with…
shanky_thebearer
  • 373
  • 1
  • 3
  • 11
1
vote
0 answers

Hadoop/Pig Aggregate Data

I am working on a project with two data sets. A time vs. speed data set (let's call it traffic), and a time vs. weather data set (called weather). I am looking to find a correlation between these two sets using Pig. However the traffic data set has…
BigDataDude
  • 111
  • 1
  • 2
  • 6
0
votes
1 answer

Extract company names/job titles from free text

I have a complete Hadoop platform with HDFS, MR, Hive, PIG, Hbase, etc., Python, R, Java. All data sets have a large size. The data set A, describing the jobs of people working in a company, is composed of the following fields: Id Person: a unique…
user17241
  • 151
  • 1
  • 7
0
votes
1 answer

Convert date into number - Apache PIG

Imagine that I've a field called date in this format: "yyyy-mm-dd" and I want to convert to number like "yyymmdd". For that I'm trying to use this: Data_ID = FOREACH File GENERATE…
João_testeSW
  • 179
  • 2
  • 3
  • 13