PySpark: Difference between revisions
From charlesreid1
(Created page with "=Installing= ==Mac== Ensure you have a Python 3.0 distribution installed. Install Apache Spark using Homebrew: <pre> $ brew install apache-spark </pre> =Flags= ...") |
(→Mac) |
||
| Line 3: | Line 3: | ||
==Mac== | ==Mac== | ||
Ensure you have | Ensure you have the following software installed: | ||
* Python 3.x distribution | |||
* Jupyter notebook | |||
* Java 8 JDK (link: https://www.java.com/en/download/faq/java_mac.xml) | |||
===Installing with Homebrew=== | |||
Install Apache Spark using Homebrew: | Install Apache Spark using Homebrew: | ||
| Line 11: | Line 16: | ||
</pre> | </pre> | ||
This should put pyspark on your path: | |||
<pre> | |||
$ which pyspark | |||
/usr/local/bin/pyspark | |||
</pre> | |||
===Installing Manually=== | |||
To get the latest version, download Spark from this page: http://spark.apache.org/downloads.html | |||
Spark uses the Scala build tool, so install that using Homebrew: | |||
<pre> | |||
$ brew install sbt | |||
</pre> | |||
Now unzip the Spark source, enter the directory, and run: | |||
<pre> | |||
$ sbt assembly | |||
</pre> | |||
Ensure Spark was built correctly by running this command from the same directory: | |||
<pre> | |||
$ bin/pyspark | |||
</pre> | |||
==Linux== | |||
Have the following software installed: | |||
* Python 3.x distribution | |||
* Jupyter notebook | |||
* Java 8 JDK (link: http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html) | |||
Download spark from this page: http://spark.apache.org/downloads.html | |||
Now get the Scala build tool into aptitude (see https://stackoverflow.com/questions/35529913/how-to-install-sbt-on-ubuntu-debian-with-apt-get): | |||
<pre> | |||
$ echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list | |||
$ sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 642AC823 | |||
$ sudo apt-get update | |||
$ sudo apt-get install sbt | |||
</pre> | |||
Now unzip the Spark source, enter the directory, and run: | |||
<pre> | |||
$ sbt assembly | |||
</pre> | |||
Ensure Spark was built correctly by running this command from the same directory: | |||
<pre> | |||
$ bin/pyspark | |||
</pre> | |||
==Testing Out Pyspark== | |||
Test it out by running the pyspark command. This should look a bit like Python, but with a Spark splash message: | |||
<pre> | |||
$ pyspark | |||
Python 2.7.10 (default, Feb 7 2017, 00:08:15) | |||
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin | |||
Type "help", "copyright", "credits" or "license" for more information. | |||
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties | |||
Setting default log level to "WARN". | |||
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). | |||
17/09/26 17:53:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable | |||
17/09/26 17:53:16 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0 | |||
17/09/26 17:53:16 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException | |||
17/09/26 17:53:17 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException | |||
Welcome to | |||
____ __ | |||
/ __/__ ___ _____/ /__ | |||
_\ \/ _ \/ _ `/ __/ '_/ | |||
/__ / .__/\_,_/_/ /_/\_\ version 2.2.0 | |||
/_/ | |||
Using Python version 2.7.10 (default, Feb 7 2017 00:08:15) | |||
SparkSession available as 'spark'. | |||
>>> | |||
</pre> | |||
Test that it's ok by checking if the sc variable is holding a Spark context: | |||
<pre> | |||
>>> sc | |||
<SparkContext master=local[*] appName=PySparkShell> | |||
</pre> | |||
=Flags= | =Flags= | ||
[[Category:Data Engineering]] | [[Category:Data Engineering]] | ||
Revision as of 00:56, 27 September 2017
Installing
Mac
Ensure you have the following software installed:
- Python 3.x distribution
- Jupyter notebook
- Java 8 JDK (link: https://www.java.com/en/download/faq/java_mac.xml)
Installing with Homebrew
Install Apache Spark using Homebrew:
$ brew install apache-spark
This should put pyspark on your path:
$ which pyspark /usr/local/bin/pyspark
Installing Manually
To get the latest version, download Spark from this page: http://spark.apache.org/downloads.html
Spark uses the Scala build tool, so install that using Homebrew:
$ brew install sbt
Now unzip the Spark source, enter the directory, and run:
$ sbt assembly
Ensure Spark was built correctly by running this command from the same directory:
$ bin/pyspark
Linux
Have the following software installed:
- Python 3.x distribution
- Jupyter notebook
- Java 8 JDK (link: http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html)
Download spark from this page: http://spark.apache.org/downloads.html
Now get the Scala build tool into aptitude (see https://stackoverflow.com/questions/35529913/how-to-install-sbt-on-ubuntu-debian-with-apt-get):
$ echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list $ sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 642AC823 $ sudo apt-get update $ sudo apt-get install sbt
Now unzip the Spark source, enter the directory, and run:
$ sbt assembly
Ensure Spark was built correctly by running this command from the same directory:
$ bin/pyspark
Testing Out Pyspark
Test it out by running the pyspark command. This should look a bit like Python, but with a Spark splash message:
$ pyspark
Python 2.7.10 (default, Feb 7 2017, 00:08:15)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/09/26 17:53:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/09/26 17:53:16 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
17/09/26 17:53:16 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
17/09/26 17:53:17 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.2.0
/_/
Using Python version 2.7.10 (default, Feb 7 2017 00:08:15)
SparkSession available as 'spark'.
>>>
Test that it's ok by checking if the sc variable is holding a Spark context:
>>> sc <SparkContext master=local[*] appName=PySparkShell>