Hive is an all-in-one project management tool developed to “help teams move faster” regardless of how they work. Features are created based on users’ requests and are updated weekly, making Hive the world’s first democratic software platform. It’s best known for its capabilities in project management, time management, team collaboration, automation, and an array of integrations with third-party software. Hive is free to use for solo users and with premium versions available to teams and enterprises.
Capabilities |
|
---|---|
Segment |
|
Deployment | Cloud / SaaS / Web-Based, Mobile Android, Mobile iPad, Mobile iPhone |
Support | 24/7 (Live rep), Chat, Email/Help Desk, FAQs/Forum, Knowledge Base, Phone Support |
Training | Documentation |
Languages | English |
Schema on any format HDFS files. Easy to download the data. A complete tool similar to database tool like toad.
Performance,sometime it is very difficult to run queries. Gui can be improved with more user friendly options
Data processing for regulatory reporting ...maintain lineage
I was a top fan of Impala for a while until I reached a series of limitations that were impossible to overcome. I work a lot with arrays and just the fact of being able to use array_contains in impala made me switch to Hive. Also, we are moving fast on the direction of self made Macross for hive that let us do complex queries without lateral view explodes
Session creation takes a while and speed is quite slow when comparing to Impala
Complex data analysis with tables that have several billion rows by partition
It is highly flexible in configurations. So many options to load data from- directly from linux file system or hdfs. You can create external and managed tables. One fun feature is that you can shoot bash commands from hive as well
It cannot be used for streaming data. Error logging can be improved so that error tracking and resolution can be more efficient.
It is used to transform and process Big Data datasets in batches. It can handle TBs of data. Push predicate feature has greatly improved the performance of the queries and the developer doesn't need to think about it anymore
Hive is great for handling logs in big data projects. We are using the same in our project and it is great for using joins and grouping which is very difficult and tricky in map reduce. It has a lot of udf packages and it is very easy to add new udfs. We were also using bucketing and clustering to optimize the query. Concept of external tables and the way we can manipulate data even when table is deleted from hive is really amazing. Lot of connectors available in the market for different softwares.
The thing which I dislike is latency and the way it saves data. While inserting data I have to wait a lot of few records. Compiler execution plan is very immature as it does not do proper query optimization. Though the community is working fast for overcoming quickly but I think it will take time for hive to be
We are using hive mainly for saving our logs. it helps us to keep track of what records are inserted, which records have failed and what are relationship between them. we are using tableau for analyzing data .
The progression of features, speed, etc brings me the strategic confidence I need in the SQL in hadoop space.
At this point, everything is on pint & theories it is great in hive 1.2
Deriving value from masses of unstructured & structured data.
The ability to view HDFS data in a relational format and easily query it through HiveQL
The fact that it uses MapReduce whether you query a pre existing table or a perform a complex query. Tez helps with this issue. Also the inability to delete/update data is a real issue and forces other services to be used eg HBase.
The ability to use Hive on HUE is perfect. We are building a platform for data scientists (prefer GUI to shell) to perform analysis so removing the need for command line is excellent.
It is very simple to use because you fill like you use simple SQL language for querying data. When I just started I didn't have any experience with Hive and in like one week I was able to query big data and do some analysis. In a month I was able to administrate data and create my own databases with the useful data. . .
Not so many implemented functions in the Hive. There are very useful Window functions but it's not enough. . . It's not that simple to modify data inside a table. . .
Analyze every day and every hour or even every minute user experience, user behavior in application or web client , etc . . .
Apache Hive is a tool built on top of Hadoop for analyzing large, unstructured data sets. Most BI and SQL developer tools can connect to Hive as easily as to any other database.
Unable to cancel a running query. Query tuning is difficult compared to RDBMS
We had a requiement to scan a large dataset for our predection algorithm. Initially we used RDBMS but the performace was very slow and user where not happy with it. We replaced RDBMS with the Hive and we are able to see a drastic improvment in the performance.
hiveql is more like SQL and really easy to learn
doesnt work good if you want a low latency queries
performance for 1TB of data
If you know SQL you will be able to get Hive really quickly. Lots of the same functionality but not exactly SQL. Easy to create tables and start writing queries allowing you to dive deeper into your data.
As with all Hadoop tools lots of knobs to tweak. Takes a good bit of time optimize and finely tune your Hive install.
Putting structure on unstructured. Once we chose hive to accomplish the aforementioned task we were able to bring our data to our data scientists quickly. An easier degree of acceptance to the Big Data idea.
- Easy to use interface - multiple clients (CLIs) - easy to debug issues with the help of fully descriptive logs - constantly the product is being improved to meet all the DB developer requirements - can be accessed from multiple applications - access through knox for additional security - no indexing - multiple file formats - the tez architecture
- authentication gaps - issues when routing through zookeeper - not as matured tool as the regular database tools
- BI team is helping all the enterprise users to ingest and access data from hadoop - most of the users are well versed with standard sql tools - to make hadoop enterprise wide solution we are training all users with hive
Hive has a simple and intuitive interface and gets the job done.
So far Hive has met and exceeded all my expectations.
Working on a Hadoop system to determine recruiters that are spamming members too much.
It's performance using distributed computation
Limited options for query performance optimization
It is very good for OLAP related tasks
Leverage sql skills to perform operations on data stored in hadoop.
Works on map reduce algorithm, so the retrieval of data is a little slow.
Allowed business users to query data using sql skills.
The best thing about HIVE is that anyone that is familiar with SQL can take advantage of HIVE's ability to run map reduce jobs. Newer version of HIVE is getting better at supporting windowing functions and fleshing out any inconsistencies. So far the documentation is good enough for getting me through my tasks and there is still on-going support for this product, which is a pretty good sign to me.
Older versions of HIVE sucks. There are lots of limitations that will force you to write HiveQL queries that are not straight forward and, even potentially, inefficient. For example, no support for window functions and no equality comparisons on joins can make your life very difficult so you will need to fall back to using some whacky full joins or self joins to accomplish the same task.
We are using HIVE as a data warehouse. One of the benefits of HIVE is that it can break your SQL queries into a series of map reduce jobs, so its supposed to speed up your queries if given enough compute nodes.
Hive is the best out there for answering ad-hoc queries in parallel paradigm. It works very well with Hadoop Echo system (mainly integrates perfectly with HDFS). - Easy to use as it implements most of SQL functions.
- Needs more optimization for complex queries (like caching, auto-partitioning,etc ...) to speed up the latency of the queries. - Tuning the hive parameters is really challenging for the users. The default settings don't work with the large queries. - Hive is perfect if 90-95% of the queries are read-only. It is not suitable for applications with heavily updates
Get quick insights from big data in case of the customers' data don't fit on one machine. It helps a lot for data preparation (i.e. creating temporary tables), that can be consumed by other machine learning solutions like Spark to build machine learning models that add more business values.
For all its processing power, Pig requires programmers to learn something on top of SQL. It requires learning and mastering something new. Hive statements are remarkably similar to SQL and despite the limitations of Hive Query Language (HQL) in terms of the commands that it understands, it is still very useful. Hive provides an excellent open source implementation of MapReduce. It works well when it comes to processing data stored in a distributed manner, unlike SQL which requires strict adherence to schemas while storing data.
Despite the working differences, once you enter the Hive world from SQL, similarity in language ensures smooth transition but it is important to note the differences in constructs and syntax, else you’re in for frustrating times.
data extracting, processing and analysis. It's fast.
Stable product; Easy to use; Multiple computation engines - Tez, MR; Almost all SQL capabilities;
Delete support is still not there even though they are nearly there.
Primary Querying engine for Data Analytics
Provides quick results based on a hadoop database, easy to use interface with simple set up steps
Some quirks with HiveQL may require referencing the documentation, but there is a lot of similarity with other SQL based languages.
Data analytics, making vast amounts of data available for general BI uses
The best part is being able to use a familiar syntax.
Doesn't support all MYSQL use-cases (understandably).
Ad-hoc queries on ETL'd production data.